Creating rich customer understanding based on clickstream data

According to Wikipedia, a clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application. Traditionally, companies have been using clickstream data to get information about customer actions on their website and to understand what is working well on their website and to optimize their sales conversion funnel.

Many e-commerce companies are already really good and experienced in their sales funnel optimization based on this data. New, interesting use case for clickstream data is to use it for a rich customer understanding.  Gaining rich customer understanding requires collecting a detailed set of individual customer-level behavioral data. In addition to the data collected from the company’s own web sites, it may also be enriched with behavioral data from mobile applications, social media and bought media.

Customers use many different devices in their interactions with the company, which makes the data-gathering and data processing into one customer centric model even more challenging. Achieving the most comprehensive customer understanding requires that the sessions via different devices are linked to the same customer.

clickstream data

Business analytics, such as attribution modeling, can be built on the collected data, and customer-level analytics and customer-focused communications can be enabled. We have been telling more about the business benefits and challenges of targeted marketing in our previous blog posts.

How this is done safely with cloud-based and open source solutions?

The properly built and configured cloud-based environment is as secure as, or even more secure than, on premise or other local solution to store data. In all solutions, both network and application-level data security must be taken care of and the security related responsibilities for both the service provider and for the company using the service must be understood.

However, due to the privacy concerns and regulatory restrictions, it often more straightforward to start utilizing cloud platforms with data that does not contain strong customer identifiers. In many cases a weaker customer identifier (cookie-id, mobile-id, and other surrogate-id) can be used.  ­­Clickstream data is often collected using cloud based web analytics solutions, making it an ideal candidate to store and process it in cloud environment - even for companies having some concerns about what data they should and could be storing in cloud.

A functional analytics solution in practice using Amazon Web Services & Open Source –stack:

1.    Data is typically collected with a browser based web analytics solution as it occurs.
Browser based web analytics solutions require that their Javascript code is executed when web page is loaded. The necessary JavaScript tag(s) are added to the web site to collect different events generated by customer. Good free web analytics solutions to collect the behavioral data are, e.g. Snowplow or Google Analytics.

2.    Depending on selected tool for data collection, there might be a need to code small custom solution for extracting collected data through API queries. This custom solution writes the extracted data to the AWS Simple Storage Service (S3) bucket, which is a cost-effective storage for raw clickstream data.

3.    Data is processed and refined using scalable Hadoop solution provided by AWS, the Elastic Map Reduce (EMR), to the required level and it is stored in another S3 bucket and / or database (e.g. RDS or Redshift, depending on the need).

4.    If there is a need for near real-time data processing with multiple data sources or there are multiple backend solutions processing data, the data collection solutions can send the events to the event queue implemented with AWS Kinesis or Apache Kafka. From this kind of event queue multiple different background applications can simultaneously pick up and process the events.

5.    The data can be analyzed with advanced analytics tools (e.g. Open source R) and traditional SQL queries either from a database set up for the processed data, or directly by utilizing Spark or Hive installed on the EMR cluster.

6.    Visualizations and metrics can be generated from the data by integrating, for example, Pentaho's open source solution or a commercial product on it, or building visualizations directly with HTML5.

7.    Different kind of actions can be triggered based on collected behavioral data. These include sending targeted e-mails or utilizing generated customer profiles with real-time personalization solutions.

clickstream blog architecture

It is quick to set up an analytics engine described above (4-6 weeks of professional work to be operational). It is a very cost-effective solution to produce a standardized platform for clickstream and other behavioral data collection, storage, and analysis.