Understanding Users in Recommender System

Of course. Data and features are the key to understand users!

Data Related to Users

There are two main types of data related to users:

Static data

Demographics data: country, gender, age and so on
Data related to the status of users: whether 1st-time/returned users..

User-item Interaction data

Historical interaction with items which can reflect users’ long-term interests
Real-time interaction with items which can represent uses’ short-term interests

In this series, we’ll mainly focus on user-item interaction data.

From User-Item Interaction Data to Users’ Interests

There are two categories of information from user-item interaction data

Interaction types
Sequential informations of these interaction

Different Interaction Types Represent Different Intentions

Take e-commerce as an example, makeing a booking represents a strong postive signal, while Making a click is a relative weak positive signal. At the same time, impression without click is a weak negative signal.

‣

Intentions of various user-item interactions in the platform of booking.com

User-item Interaction can mostly represented as sequence data, and many important information are embedded in the sequencial info

For example, user A from an e-commerce platform:

Short-term interest: In the last 30 min or 1 day, user A clicked many types of laptops and the clicks from Apple are more → user A might want to buy laptops and he/she might have preference to laptop from Apple
Long-term interest: Considering user A’s orders in the last one years, most of the orders are electronics and books → this user might have a long-term interest to electronics and books
Evolution of interest: This user’s clicks to laptops in last 4 weeks decreased every week → this user might lose interest in laptop or have bought it somewhere else

How to Represent User-Item Interaction and Users’ Interests

To help model learn effeciently, first, we need to represent these data.

User-Item Interaction Data

Normally, in user-item interaction data, instead of item_id, every item can be represented with

Static info of items such as item name, category/sub-category that this item belong to and so on
Type/timestamp of user-item interactions
Item embedding learnt with embedding technique

In this case, each item_id in the user-item interaction data is mapped to the combination of the above three categories of data

Users’ Interests

Most of the time, user embedding is extracted from user-item interaction data to represents users’ short and long term interests. It’s also an option to use several user embeddings to represent a user’s diverse interests

Preliminary: from Pooling to Transformer

Since user-item interaction data is usually in the form of sequential data, techniques from NLP are often borrowed. Let’s first check some preliminary knowledge.

We’ll only review the necessary knowledge in order to understand next section, for detailed explanation, I recommend to check Chapter 11 Attention Mechanisms and Transformer from the book: Dive Into Deep Learning

Pooling & Attention

In computer vision and natural language processing, Pooling and attention share the same high-level goal of capturing important information.

Pooling

The most important function of pooling is to reduce the spatial dimensions of feature map, thereby reducing the computational complexity and preventing overfitting. There are different types of pooling mechanisms

Min pooling
Max pooling
Average pooling
Weighted pooling

Attention

Attention is an mechanism that allows neural network to focus on the most relevant parts of the input data when making predictions.

Given a query $q$ and $m$ tuples of keys and values $\{(k_1, v_2), (k_2, v_2)...(k_m,v_m)\}$ , then attention is defined as

Attention(q, k, v) = \sum_{i=1}^{m}softmax(a(q, k_i)) \space v_i

Where softmax is used to get the normalized and non-negative weights and $a(q, k_i)$ is the attention scoring function. There are mainly two types of attention scoring functions:

Dot product attention (or scaled dot product attention)
Additive attention

Attention Mechanism (image from Dive into Deep Learning)

Sequence-to-Sequence

For the task of machine translation, it’s common to adopt an encode-decoder architecture for sequence-to-sequence based on two RNN.

The RNN encoder generates a fixed-shape context variable and feed it to the decoder.
The decoder generates the output based on this fixed-shape context variable and the generated tokens

Encoder-decoder Architecture(image from Dive into Deep Learning)

Sequence-to-Sequence with Attention

In the sequence-to-sequence model, the encoder is actually summarizing information from input sequence to a fixed-shape of context variable. However, when coping with long sentences, there might not be enough space in this context variable(hidden states) to include all important information. Consequently, the decoder will fail to translate long and complex sentences.

To solve the above issue, attention mechanism is introduced to sequence-to-sequence model. And the key idea is: instead of a fixed-shape context variable, all hidden states from encoder will be feed to decoder. At each step of decoder, an attention mechanism will be applied to these hidden states to aggregate and extract related important information.

Sequence-to-sequence with Attention (image from Dive into Deep Learning)

Transformer

Transformer was original invented for machine translation and it has an encoder-decoder structure which is same as that of sequence-to-sequence model. The main changes are:

it completely removed the common structures in sequential model: recurrent layer or convolutional layer.
Instead, positional embedding and multi-head self-attention are used to capture the sequential property and the dependency among input data. Below is the structure of Transformer.

Architecture of Transformer

Encoder-only Transformer, Decoder-only Transformer and Encoder-decoder Transformer

In reality, Transformer can be used in different forms. Take large-scale pre-trained language model as an example:

Encoder-only Transformer: BERT
Encoder-decoder Transformer: T5
Decoder-only Transformer: GPT

Overall, each types of Transformer has their own strength:

Encoder-only Transformer is primarily designed to understand the input sequence, and it can only generate fixed-length representation. It excels at tasks like text classification, token prediction and sequence labelling
While Decoder-only Transformer is designed to generate text in an autoregressive manner, which allows it to produce coherent and contextually relevant output sequences. It make decoder-only Transformer, such as GPT, well-suited for tasks like open-ended text generation, dialogue systems, and language modelling

The Evolution of User Interest Modeling Techniques

Before Deep Learning: Feature Engineering

Before deep learning, to learn users’ interest, features are created based on historical or real-time interaction by counting the number of various types of interactions.

Long term interest: number of bookings from a product or category of product
Short term interest: number of clicks in the last 10 min or the same session
Evolution of interests: number of clicks/bookings/time spent on a product or category of products in the last 1 day/7days/1 month/3 months/1 year

Concatenation & Pooling → Aggregating All Information

As mentioned before, since deep learning become popular, item can be represented as embedding. Then we can feed the item embedding of clicked items directly to the model to let model learn the pattern themself: item embedding are concatenated and feed into the model.

However, most of the time, there are so many user-item interactions, to accelerate training, it’s better to extract and keep the most important information with pooling. Except sum/average/min pooling, weighed sum pooling is also one of the solution where weights are decided by the time this user-item interaction happened

Below are two real-world examples:

‣

Wide&Deep Learning(WDL) model from google (2016), where the embeddings of installed apps and impression apps are concatenated

‣

Model from the candidate generation stage of Youtube (2016), in which average pooling is applied to the embeddings of watched videos and search tokens

Target Attention→ Focusing on Relevant Information

📌

While users are checking items, intuitively, historical behaviors related to candidate item contribute greatly to users’ feedbacks: while checking laptops, users’ previous interaction with laptop instead of books contributes to the decision making

In term of pooling and concatenation mentioned before, most of the time, all the items are treated equally, but actually, as mentioned above, they contibute differently in decision making.

To solved this issue, target attention is introduced: attention weights are calculated based on relevance between candidate item and historical item, then item embedding are aggregated based on these attention weights

‣

Deep Interest Network (DIN) from Alibaba (2018) is an typical example

Sequential Model with Attention → Extracting Interest Evolution

Although, compared to concatenation and pooling, attention mechanism caputure users’ interest better, sequential information is still not considered. Then sequential with atttention is introduced:

‣

Deep Interest Evolution Network (DIEN) from Alibaba (2019)

Transformer → Learning Users’ Interests More Efficiently

Compared to sequential model with attention mechanism, Transformer can achieve the same goal, however, it’s more effecient.

As mentioned above, Transformer can be used in different ways. Since, user interest modeling, the purpose is mainly to extract important information from user-item interaction data, encoder-only Transformer is used more. However, depending on the use cases, encoder-decorder transformer and adapted version of encoder-only Transformer are alse adopted:

Most of the time, we expect model to consider the relationship between candidate item and historical items. There are two options

‣

Encoder-decoder transformer where the embedding of candidate item are used as query in decoder and embeddings of historical items are used as query, key and value of encoder. Below is the Deep Interest Transformer from JD.com, which is an e-commerce platform in China:

Adapted encoder-only transformer

Target attention from model such as DIN

‣

Self-attention funcitons as target attentionn: the target item is treated as the last item of the historical user-item interaction, and the combined sequential data is used as query, key and value of encoder. In the case, the relationship between candidate item and historical item, between historical items can be learnt together. Below is the Behavior Sequential Transformer (BST) from Alibaba:

Due to the computation cost, there are use cases where users’ interests should be extracted offline, then these information can be used online. In this case, while extracting user interests, information of candidate item don’t exist and standard encoder-only Transformer is good option:

‣

PinnerFormer from Pinterest:

Diverse Interests Modeling

📌

We can always find diverse intersets from user-item interaction data: in one click sequence, a user might have interactions with products from serveral different categories: clothes, sports and food.

Methods explained before most of the time will create a single embedding to represent users’ interests. However, as mentioned aboved, users have diverse interests and an embedding might not be enough to represents all of them. Then a new idea was proposed to represent users’ interests with mutiple embeddings.

Techniques such as capsule routing mechanism and self-attention are adopted to achieve this goal:

‣

Multi-Interest Network with Dynaimic rounting (MIND) from Alibaba (2019)

‣

Controllable Multi-interest framework for recommendation (ComiRec) from Alibaba

Life-long User Interest Modeling

More User Behavior, Better Model

In many scenarios, there are plenty of user actions. For example, in e-commence, there are hundreds of user actions each day.

Data below from Alibaba shows the statistics of sequential user behavior data and corresponding model performance

Challenges

However, there are challenges in terms of long user behavior modelling:

System Latency and Storage Cost: The system latency and storage cost increase approximately linearly with the length of the user behavior sequence, making it difficult to handle very long sequences.
Noise in Long Histories: There is a lot of noise and irrelevant information in extremely long user behavior histories, which can degrade the performance of sequential user interest modeling.

Solutions

To solve these challenges, two types of solutions are invented:

More powerful model with more effecient archtecture: MIMN from Alibaba
Retrieval-based solution: retrieve the most relevant historical behavior first, then apply learning algorithms

Two-stage user behavior retrieval: SIM from Alibaba, SDIM from Meituan
End-to-end user behavior retrieval: ETA from Alibaba

In terms of the first type of solution, an example is Muti-channel user Interest Memory Network (MIMN) from Alibaba, where memory-based archtecture is utilized to capture user interests. MIMN is the first industrial solution that can model sequential user behavior data with length scaling up to 1000.

‣

Architecture of MIMN from Alibaba (2019)

For the retrieval-based solution, examples are Search-based Interest Model (SIM) from Alibaba and Sampling-based Deep Interest Modeling(SDIM) from Meituan. The main difference is how to retrieve the most importnat items in the first stage. SIM can model sequential user behavior data with length scaling up to 54000 !

‣

SDIM from Meituan (2022)

Recenly, it’s found that two-stage user behavior retrieval suffers from divergent target and outdated embedding. End-to-end learning is a better solution, and End-to-end Target Attention (ETA) from Alibaba is an good example

‣

ETA from Alibaba (2021)

Summary

To wrap up,

Lots of information are encoded in user-item interaction data: users’ short and long term interests, diverse interests and evolution of interest.
Due the fact that user-item interaction data is sequencial data, many techniques are borrowed from NLP. And currently, same as that in NLP, Transformer is the most popular technique to learn users’ interests.
In industry, there are trends to include more user-item interaction data to learn more aspects of users’ interests

‣

User Interest Modeling: Part 1 - The Evolution of Techniques