Tuesday, March 4, 2025
HomeAnalyticsHow do you use sequential data to refine navigational analysis❓

How do you use sequential data to refine navigational analysis❓


When it comes to analysing behavioural web analytics data, analysts often work with aggregates: visit duration, number of page views, bounce rate… These variables make it possible to describe each navigation sequence in a very structured way. They are easy to represent for analysis purposes, or to use in a machine learning environment for example. 

In this article, we provide suggestions for processing navigation data at a somewhat more refined, less aggregated scale, and with new information. The aim is to exploit the sequences of events constituting the visits in a quantitative and fully automated way. More specifically, we look at page sequences over the course of visits, bearing in mind that the approach is transposable to sequences of events, products, etc. 

The first step is to define the objectives of sequence data processing and then to outline a procedure for the automatic extraction of sequences of interest. 

Objectives: identify and exploit sub-sequences of interest 

As Data Scientists at AT Internet, our role is to design solutions that automate the extraction of information and assist our users in their analyses. Since our data model allows us to define objectives (financial or not) on websites, we decided to use it as a variable of interest. In this context, we aim to: 

  • Identify navigation sub-sequences in relation to the conversion rate and highlight these relationships.  
  • Annotate the presence of such sub-sequences in the customer journeys and use them as predictive variables in a machine learning model. 

For example: a website records a conversion rate of 3%, but we find that visits to pages A, B and then C generate a conversion rate of 14%. The sequence A>B>C has an associated conversion rate higher than the average conversion rate. It can therefore be considered as a sub-sequence of interest. 

Data sets and modelling 

Identifying data sets 

First of all, we need to give some context to our navigation sequence data: 

  • Each sequence is annotated as successful or unsuccessful. 
  • We work with sequences of pages visited, for example : [Home > Product1 > Conditions of sale] (the sequences are deliberately reduced when the conversion pages are reached in order not to introduce a bias). 

In the end, you get a dataset like this one: 

user  sequence  conversion 
123  Homepage > Product1 > Privacy > Contact  false 
456  Product 1 > Product 2 > Product 1 > Return Policy  true 
789  Product 2  false  

Such data sets are often imbalanced. The group that does not convert is often in the majority: they frequently represent more than 95% of visitors. This is a problem frequently encountered in the machine learning process (conversion, fraud detection, churn, …) and which could be the subject of an entire article by itself! 

Modelling and evaluation of sequences of interest 

What are the objectives? 

A sequence of interest is defined by two parameters :   

  • Its presence must affect the conversion rate (positively or negatively). 
  • It must be sufficiently present for its impact to be significant across the dataset. 

Quantitative evaluation 

In science, it is best practice to reason quantitatively, so we’ll start by focusing on a metric that will be used to evaluate potential candidate navigation sequences. In this way, we will be able to classify them and retain only the most useful ones. 

In information theory, there is a metric that quantifies uncertainty: entropy. If we consider Y∈{0,1} a random variable indicating whether there has been a conversion, the entropy H(Y) makes it possible to quantify how random this event is on a scale of 0 to 1. In the case of a binary variable, the entropy is simply defined as follows: 

H(Y)=−(P(Y=0)log(P(Y=0))+P(Y=1)log(P(Y=1))) 

(1) 

or P(Y=1) which means: “the probability that Y is equal to 1”, i.e. the probability that a conversion will take place. We can estimate this probability by the observed conversion rate. 

A simple graph helps to interpret this: 

The further the probability of the outcome deviates from equilibrium (0.5), the less uncertainty there is about the outcome (so far, logical!). 

What interests us is the impact of the presence of any sequence of pages (e.g. A>B>C) on the realisation of Y. By noting X the presence of A>B>C in the sequence (1 if present, 0 if not), we can define the conditional entropy

H(Y|X)=P(X=1) H(Y|X=1) + P(X=0) H(Y|X=0) 

(2) 

or H(Y|X=1) which means: “the entropy of Y, when X is 1”, i.e. the uncertainty that reigns over the act of conversion, when we know that the sequence A>B>C is included in the sequence of pages visited. 

Conditional entropy now allows us to quantify the uncertainty on the outcome of Y when the outcome of X is known. 

We have now brought together the two ingredients needed to calculate mutual information

I(X,Y)=H(Y)−H(Y|X) 

(3) 

We therefore read that I(X,Y) is the difference between the uncertainty of the outcome of Y (will the user convert?) and the uncertainty of Y when X is known (will the user convert knowing that they have carried out the sequence A>B>C?). So, this is precisely the information that X carries about Y. 

It is important to note that in addition to meeting our need for information extraction, mutual information also depends on the frequency of appearance of sequences (see 2) and the conversion rate when they are present or not (see 1 and 2), as desired in the previous section. 

Information theory was founded by Claude Shannon in the late 1940s and was initially applied in the field of telecommunications. Today it is present in many algorithms used for artificial intelligence applications. Its foundations are presented here in a very simplified way, but a solid mathematical theory guarantees the concepts presented. This video (in French with subtitles) details the origins of this exciting field. 

Candidate sequences 

Now that we are fully equipped to classify sequences by utility, we need to define a strategy for generating candidate sequences. The search space is indeed rapidly becoming vast: if we take a website with 15 pages and consider only navigation sequences of 20 pages at most, we have to test more than 15^20≈3.10^23 different sequences, which is about the number of stars in the observable universe – quite a few… 

The choice of our generation of candidate sequences is guided by two needs:   

  • The generation must be based on simple algorithms, leaving room for future iterations and being able to be presented to less technical stakeholders. 
  • It must use algorithms available in big data frameworks, in order to guarantee its generalisation to all AT Internet clients, whose data volumes vary considerably. 

For these two reasons we have chosen the PrefixSpan algorithm, which can be explained in a few lines with a schema and whose implementation is available in the Apache Spark MLlib

PrefixSpan is a sequential pattern mining algorithm: in our sequence dataset, it is about identifying the most frequent sub-sequences. The algorithm has two parameters: 

  • The minimum support: the minimum frequency of a sub-sequence through the dataset to be considered frequent. 
  • The maximum length of the sub-sequences searched for, in order to limit exploration. 

PrefixSpan guarantees the extraction of all frequent sub-sequences (in the sense of minimum support) – for this purpose it explores the tree of possible sub-sequences. The diagram below is an example with pages named a, b, c, …, z. A path from the root of the tree to a node therefore represents a sub-sequence of visited pages. 

As mentioned above, it is unthinkable to explore this tree entirely. This is where PrefixSpan comes into play – by reducing the cost of exploration using two strategies: 

  • Pruning of the tree by the anti-monotonicity rule: if the A>B>C sub-sequence does not reach the frequency threshold, any sequence of which it is the prefix will have a lower frequency. For example A>B>C>D will necessarily be less frequent than A>B>C, since the second is included in the first. There is therefore no need to explore the sequences for which A>B>C is the prefix, so we can prune all the branches that follow and reduce the calculation. 
  • The projection of candidate post-fixes: at each step, only the post-fixes (sequences following the prefix) still candidate for the calculation are kept. For example, if we have just processed the sequence A>B>C, we have eliminated at the first stage all the sub-sequences not beginning with A, then all those not beginning with A>B, then all those not beginning with A>B>C: quickly the set of sequences to be tested diminishes. 

These are the broad outlines of the algorithm, but you can find more detailed information in the original article

Summing up 

We have set ourselves an objective: to identify informative sub-sequences on a conversion act. 

We have quantified the notion of information subsequence using mutual information. 

We have a routine for generating candidate sequences, with the guarantee that they are frequent: PrefixSpan. This step requires a minimum frequency of appearance and a possible maximum sequence length. 

The process of extracting informative sequences consists of generating candidate sequences, then accepting or rejecting them according to the criterion of mutual information, or to a lesser extent, ordering them according to the amount of information they provide. 

Analytics benefits 

From a descriptive point of view 

This work makes it possible to develop an advanced sequence exploration product to complement the navigation module. An analyst could filter the sequences according to a minimal presence and analyse the effect of such sequences on the conversion rate. Alternatively, they could keep only the sequences associated with the highest / lowest conversion rates and take actions to promote / avoid such sequences. Finally, by setting conversion and frequency thresholds, any AT Internet user could quickly identify which are the key sequences of their visitors. 

From a predictive point of view 

As part of a project to identify hesitant shoppers, we began by developing a model for learning purchasing behaviour based on the simple metrics we described at the beginning of this article: page views, visit time, number of visits, etc. Then, as the model was iterated, we injected variables into it to indicate the presence of sequences that had been selected as informative using the procedure described above. The addition of these variables allowed us to improve our prediction metrics by 5 to 10% depending on the cases and they were therefore retained in our modelling. 

To go even further 

An informed reader will notice some shortcuts to this presentation. In order to avoid overburdening the technical aspects, a few problems have been left out, such as: 

  • The imbalance of classes means that user-specific sequences that convert carry little weight in the calculation of PrefixSpan support. For example, if the conversion rate is only 1% and a discriminating sequence is only present in the buyer group, its frequency will only weigh for at most 1% in the calculation of the minimum support (which was more like 20% in our uses, for a quality/quantity compromise of the sequences) and the sequence will be filtered. It is advisable to carry out the extraction treatment separately for the two groups (navigation sequences with and without conversion).  
  • As sequences are extracted using a univariate criterion (only one sequence is considered at a time), it is common for similar sequences or sequences bearing the same information to be extracted. Post-processing by looking for co-occurrences may be necessary to reduce information redundancy. 
  • Finally, a paper co-authored by Amazon and Siemens suggests a sequence extraction procedure that takes on board the notion of selection during sequence generation (based on a different algorithm: BIDE), which makes it possible to prune the tree of possible sequences all the more efficiently. This approach seems promising, but it was not initially explored because of its implementation cost. 
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar