- Understanding Users in Recommender System
- Data Related to Users
- From User-Item Interaction Data to Users’ Interests
- How to Represent User-Item Interaction and Users’ Interests
- Preliminary: from Pooling to Transformer
- Pooling & Attention
- Sequence-to-Sequence
- Sequence-to-Sequence with Attention
- Transformer
- The Evolution of User Interest Modeling Techniques
- Before Deep Learning: Feature Engineering
- Concatenation & Pooling → Aggregating All Information
- Target Attention→ Focusing on Relevant Information
- Sequential Model with Attention → Extracting Interest Evolution
- Transformer → Learning Users’ Interests More Efficiently
- Diverse Interests Modeling
- Life-long User Interest Modeling
- Summary
- Reference
Understanding Users in Recommender System
Of course. Data and features are the key to understand users!
Data Related to Users
There are two main types of data related to users:
- Static data
- Demographics data: country, gender, age and so on
- Data related to the status of users: whether 1st-time/returned users..
- User-item Interaction data
- Historical interaction with items which can reflect users’ long-term interests
- Real-time interaction with items which can represent uses’ short-term interests
In this series, we’ll mainly focus on user-item interaction data.
From User-Item Interaction Data to Users’ Interests
There are two categories of information from user-item interaction data
- Interaction types
- Sequential informations of these interaction
Different Interaction Types Represent Different Intentions
Take e-commerce as an example, makeing a booking
represents a strong postive signal, while Making a click
is a relative weak positive signal. At the same time, impression without click
is a weak negative signal.
User-item Interaction can mostly represented as sequence data, and many important information are embedded in the sequencial info
For example, user A from an e-commerce platform:
- Short-term interest: In the last 30 min or 1 day, user A clicked many types of laptops and the clicks from Apple are more → user A might want to buy laptops and he/she might have preference to laptop from Apple
- Long-term interest: Considering user A’s orders in the last one years, most of the orders are electronics and books → this user might have a long-term interest to electronics and books
- Evolution of interest: This user’s clicks to laptops in last 4 weeks decreased every week → this user might lose interest in laptop or have bought it somewhere else
How to Represent User-Item Interaction and Users’ Interests
To help model learn effeciently, first, we need to represent these data.
User-Item Interaction Data
Normally, in user-item interaction data, instead of item_id
, every item can be represented with
- Static info of items such as item name, category/sub-category that this item belong to and so on
- Type/timestamp of user-item interactions
- Item embedding learnt with embedding technique
In this case, each item_id
in the user-item interaction data is mapped to the combination of the above three categories of data
Users’ Interests
Most of the time, user embedding is extracted from user-item interaction data to represents users’ short and long term interests. It’s also an option to use several user embeddings to represent a user’s diverse interests
Preliminary: from Pooling to Transformer
Since user-item interaction data is usually in the form of sequential data, techniques from NLP are often borrowed. Let’s first check some preliminary knowledge.
We’ll only review the necessary knowledge in order to understand next section, for detailed explanation, I recommend to check Chapter 11 Attention Mechanisms and Transformer from the book: Dive Into Deep Learning
Pooling & Attention
In computer vision and natural language processing, Pooling and attention share the same high-level goal of capturing important information.
Pooling
The most important function of pooling is to reduce the spatial dimensions of feature map, thereby reducing the computational complexity and preventing overfitting. There are different types of pooling mechanisms
- Min pooling
- Max pooling
- Average pooling
- Weighted pooling
Attention
Attention is an mechanism that allows neural network to focus on the most relevant parts of the input data when making predictions.
Given a query and tuples of keys and values , then attention is defined as
Where softmax
is used to get the normalized and non-negative weights and is the attention scoring function. There are mainly two types of attention scoring functions:
- Dot product attention (or scaled dot product attention)
- Additive attention
Sequence-to-Sequence
For the task of machine translation, it’s common to adopt an encode-decoder architecture for sequence-to-sequence based on two RNN.
- The RNN encoder generates a fixed-shape context variable and feed it to the decoder.
- The decoder generates the output based on this fixed-shape context variable and the generated tokens
Sequence-to-Sequence with Attention
In the sequence-to-sequence model, the encoder is actually summarizing information from input sequence to a fixed-shape of context variable. However, when coping with long sentences, there might not be enough space in this context variable(hidden states) to include all important information. Consequently, the decoder will fail to translate long and complex sentences.
To solve the above issue, attention mechanism is introduced to sequence-to-sequence model. And the key idea is: instead of a fixed-shape context variable, all hidden states from encoder will be feed to decoder. At each step of decoder, an attention mechanism will be applied to these hidden states to aggregate and extract related important information.
Transformer
Transformer was original invented for machine translation and it has an encoder-decoder structure which is same as that of sequence-to-sequence model. The main changes are:
- it completely removed the common structures in sequential model: recurrent layer or convolutional layer.
- Instead, positional embedding and multi-head self-attention are used to capture the sequential property and the dependency among input data. Below is the structure of Transformer.
Encoder-only Transformer, Decoder-only Transformer and Encoder-decoder Transformer
In reality, Transformer can be used in different forms. Take large-scale pre-trained language model as an example:
- Encoder-only Transformer: BERT
- Encoder-decoder Transformer: T5
- Decoder-only Transformer: GPT
Overall, each types of Transformer has their own strength:
- Encoder-only Transformer is primarily designed to understand the input sequence, and it can only generate fixed-length representation. It excels at tasks like text classification, token prediction and sequence labelling
- While Decoder-only Transformer is designed to generate text in an autoregressive manner, which allows it to produce coherent and contextually relevant output sequences. It make decoder-only Transformer, such as GPT, well-suited for tasks like open-ended text generation, dialogue systems, and language modelling
The Evolution of User Interest Modeling Techniques
Before Deep Learning: Feature Engineering
Before deep learning, to learn users’ interest, features are created based on historical or real-time interaction by counting the number of various types of interactions.
- Long term interest: number of bookings from a product or category of product
- Short term interest: number of clicks in the last 10 min or the same session
- Evolution of interests: number of clicks/bookings/time spent on a product or category of products in the last 1 day/7days/1 month/3 months/1 year
Concatenation & Pooling → Aggregating All Information
As mentioned before, since deep learning become popular, item can be represented as embedding. Then we can feed the item embedding of clicked items directly to the model to let model learn the pattern themself: item embedding are concatenated and feed into the model.
However, most of the time, there are so many user-item interactions, to accelerate training, it’s better to extract and keep the most important information with pooling. Except sum/average/min pooling, weighed sum pooling is also one of the solution where weights are decided by the time this user-item interaction happened
Below are two real-world examples:
Target Attention→ Focusing on Relevant Information
In term of pooling and concatenation mentioned before, most of the time, all the items are treated equally, but actually, as mentioned above, they contibute differently in decision making.
To solved this issue, target attention is introduced: attention weights are calculated based on relevance between candidate item and historical item, then item embedding are aggregated based on these attention weights
Sequential Model with Attention → Extracting Interest Evolution
Although, compared to concatenation and pooling, attention mechanism caputure users’ interest better, sequential information is still not considered. Then sequential with atttention is introduced:
Transformer → Learning Users’ Interests More Efficiently
Compared to sequential model with attention mechanism, Transformer can achieve the same goal, however, it’s more effecient.
As mentioned above, Transformer can be used in different ways. Since, user interest modeling, the purpose is mainly to extract important information from user-item interaction data, encoder-only Transformer is used more. However, depending on the use cases, encoder-decorder transformer and adapted version of encoder-only Transformer are alse adopted:
- Most of the time, we expect model to consider the relationship between candidate item and historical items. There are two options
- Adapted encoder-only transformer
- Target attention from model such as DIN
- Due to the computation cost, there are use cases where users’ interests should be extracted offline, then these information can be used online. In this case, while extracting user interests, information of candidate item don’t exist and standard encoder-only Transformer is good option:
Diverse Interests Modeling
Methods explained before most of the time will create a single embedding to represent users’ interests. However, as mentioned aboved, users have diverse interests and an embedding might not be enough to represents all of them. Then a new idea was proposed to represent users’ interests with mutiple embeddings.
Techniques such as capsule routing mechanism and self-attention are adopted to achieve this goal:
Life-long User Interest Modeling
More User Behavior, Better Model
In many scenarios, there are plenty of user actions. For example, in e-commence, there are hundreds of user actions each day.
Data below from Alibaba shows the statistics of sequential user behavior data and corresponding model performance
Challenges
However, there are challenges in terms of long user behavior modelling:
- System Latency and Storage Cost: The system latency and storage cost increase approximately linearly with the length of the user behavior sequence, making it difficult to handle very long sequences.
- Noise in Long Histories: There is a lot of noise and irrelevant information in extremely long user behavior histories, which can degrade the performance of sequential user interest modeling.
Solutions
To solve these challenges, two types of solutions are invented:
- More powerful model with more effecient archtecture: MIMN from Alibaba
- Retrieval-based solution: retrieve the most relevant historical behavior first, then apply learning algorithms
- Two-stage user behavior retrieval: SIM from Alibaba, SDIM from Meituan
- End-to-end user behavior retrieval: ETA from Alibaba
In terms of the first type of solution, an example is Muti-channel user Interest Memory Network (MIMN) from Alibaba, where memory-based archtecture is utilized to capture user interests. MIMN is the first industrial solution that can model sequential user behavior data with length scaling up to 1000.
For the retrieval-based solution, examples are Search-based Interest Model (SIM) from Alibaba and Sampling-based Deep Interest Modeling(SDIM) from Meituan. The main difference is how to retrieve the most importnat items in the first stage. SIM can model sequential user behavior data with length scaling up to 54000 !
Recenly, it’s found that two-stage user behavior retrieval suffers from divergent target and outdated embedding. End-to-end learning is a better solution, and End-to-end Target Attention (ETA) from Alibaba is an good example
Summary
To wrap up,
- Lots of information are encoded in user-item interaction data: users’ short and long term interests, diverse interests and evolution of interest.
- Due the fact that user-item interaction data is sequencial data, many techniques are borrowed from NLP. And currently, same as that in NLP, Transformer is the most popular technique to learn users’ interests.
- In industry, there are trends to include more user-item interaction data to learn more aspects of users’ interests