User Interest Modeling: Part 2 - Case Studies from Alibaba and Pinterest

User Interest Modeling: Part 2 - Case Studies from Alibaba and Pinterest

Tags
Data Science
Date Published
June 30, 2024
💡
Let’s check some real-world industry practices from Alibaba and Pinterest !
Blogs from the series User Interest Modelling

Alibaba - User Interest Modeling in Pre-ranking and Ranking

Prelimenary

Related Works from Alibaba

User interest modelling is one of the most important research domain in Alibaba and many important works such DIN, DIEN are from Alibaba. Overal, these works are mainly from two teams

  1. Display advertising team:
    1. 2017, Deep Interest Network (DIN)
    2. 2018, Deep Interest Evoluation Network (DIEN)
    3. 2019, Muti-channel user Interest Memory Network (MIMN)
    4. 2020, Search-based Interest Model (SIM)
  2. Recommendation team:
    1. 2019, Behavior Sequential Transformer (BST)
    2. 2019, Multi-Interest Network with Dynamic Routing (MIND)
    3. 2019, Deep Session Interest Network (DSIN)
    4. 2020, Controllable Multi-Interest framework for the sequential Recommendation (ComiRec)
    5. 2021, End-to-end Target Attention (ETA)

Today, we’ll focus on the recommendation system from Taobao.com, which is the online shopping website owned by Alibaba Group

Recommendation System at Taobao.com

Instead of two-stage recommender, four-stage recommender is used at Taobao.

image

Today, we’ll mainly focus on the user interest modeling technique in Pre-ranking and Ranking stage.

User Interest Modeling in Pre-ranking

In the stage of pre-ranking, there are mainly two options for model archtecture

  1. Two tower architecture where item tower are calculated offline and item tower are calculated in real-time
  2. Three tower architecture, where except the standard item and user tower from two tower architecture, an user-item interaction tower is computed offline.

As for Taobao.com, three tower architecture is adopted:

  1. Long term interest is learnt in user-item interaction tower based on users’ click data in the last two years
    1. Only relevant user behaviors are used: based on the category of target item, only user behaviors from the same category are used. Overal, there are about 100+ categories
    2. Maximium 10K user behaviors can be included
  2. Short term interest is learnt in usere tower based on users’ real-time behavior sequence

In addition, to reduce the computation cost, while learning users’ long and short term interests, information of target item (candidate item) can’t be used. So self-attention is used instead of target attention.

image

User Interest Modeling in Ranking

In ranking stage, there are less items, more computation resource and the requirement to prediction accuracy is much higher. In this case:

  1. While learning both long and short term interests, the information of target (candidate) item is invovled earlier and target attention is used
  2. For long term interest, to achieve more accurate result, it’s better to
    1. Include as much as possible users’ historical behavior data
    2. Learn these pattern end-to-end in a unified model

In the ranking model, to reduce the computational complexity to learn from long-term sequence, a similarity function is learnt to select the most relevant historical behavior.

image

Pinterest - User Interest Modeling in Ranking

Prelimenary: How to Repesent Pins ?

Pins are the visual item at Pinterest. For each pins in a user’s sequence, it’s repsented by combined the data below

  1. PinSage embedding which is an aggregation of visual, text and engagement information for Pin.
  2. Action type which is mapped to trainable embedding
  3. Timestamp:
    1. the time since the latest action a user has taken
    2. the time gap between actions
  4. Duration: logarithm is applied to duration

Realtime-batch Hybrid Ranking Model: Pinnability

Realtime-batch Hybrid Ranking Model: Pinnability
Realtime-batch Hybrid Ranking Model: Pinnability

Besides

  1. In term of PinnerFormer, it’s trained separately. And its outputs, namely user embeddings, are updated incrementally: only compute the embedding for users who have engaged on Pinterest
  2. For short term interest module TransAct, it is trained together with Pinnability. It’s re-trained twice a week to make the online metric stable

Long Term Interest Module: PinnerFormer

As mentioned above, in Pinnability, user embedding learnt offline by PinnerFormer is used to represent users’ long term interest.

An approach similar to embedding based candidate generation is adopted to learn these user embdding:

  1. Label: Instead of next-item prediction task, the goal is to to predict users’ positive engagement in the next 14 days. And the assumption is that: the actions that users takes over 14 days sufficiently represent users’ long-term interest
    1. Positive engagement refers to actions such as repins , closeups and clicks
  2. Negative sampling strategy: mixed strategy negative selection:
    1. Hard negative: in-batch negatives
    2. Soft negative: random negatives
  3. Loss function: sampled softmax with a logQ correction
Training of PinnerFormer
Training of PinnerFormer

Short Term Interest Module: TransAct

TransAct Module
TransAct Module

Compared to standard encoder-only transformer, there are more adaptions in TransAct:

  1. Position encoding is not used because offline experiment showed that position information is ineffective
  2. Self-attention functions as target attention
    1. To learn the relationship between candidate item and user-action features, candidate embedding is concatenated with every user actions
  3. Random time window mask:
    1. The model became too responsive to recent user actions (e.g., after a user engages with a Pin from the food category, the next refresh of the Homefeed page is filled with more food Pins), which is not a desired user experience.
    2. To solve the above issue, a random time window mask is applied to regularize model training and reduce overfitting
  4. Transformer output compression:
    1. Since transformer output is feed to DCNv2 for feature cross, which might result in excessive time complexity. It’s a good option to compress the Transformer output
    2. Only first K columns are taken and concatenated with the max pooling of all transformer output

Summary

By checking these industry practices, we’ll find that

  1. Transformer is the main technique to model users’ interest, but depending on business scenarios and constraints, different variations are adopted
  2. It’s a common practice to learn users’ both long term and short term interests.
  3. Different techniques are adopted to balance computation cost and prediction accuracy
    1. Self-attention VS target attention
    2. Offline pre-calculation VS real-time compuation
  4. There are trend to learn more user behavior, but different technique are needed to improve the learning process
    1. Retrieve top K relevant behaviors first, then apply transformer
    2. Compress the output of transformer

Reference