Embedding in Recommender - Part 2: How is Embedding Combined with Recommender

Embedding in Recommender - Part 2: How is Embedding Combined with Recommender

Tags
Data Science
Date Published
April 14, 2024
💡
In part one, a high-level overview about embedding technique is explained. In order to easily understand the industry practices in the future blogs, we’ll also need to have a basic idea about how embedding technique is combined with recommender system.
Blogs from the series of Embedding in Recommender

Recommender System in Real-world Scenarios

In realy-world scenarios, one of the main challenges is the huge number of items and complixity of business. As a result, most of the time, the solution is a multi-stages recommender which combined both machine learning models and business rules

Two-stage recommender is the most classic one. (But I also include re-ranking in the plot below)

image

Beblow are some industry practices:

Two-stage recommender from Youtube: candidate generation → ranking
Four-stage recommender from Nvidia: Retrieval → filtering → scoring → ordering
Two-stage recommender from JD: candidate retrieval → ranking
Two-stage recommender from Facebook: Retrieval → ranking

Objectives & Requirements of Each Stage

Objectives
Requirements
Methods
Candiate Generation /Match/Retrieval/Recall
Remove unrelevant items, and reduce number of items
1. Fast speed for inference → simpler model, less features 2. no need to be that accurate 3. Retrieve from different resouces
Multi-strategy candidate generation: retrieve items from differnt method such as rules based, content-based model, collaborative filtering based model
Ranking/Scoring
Rank items as accurate as possible
1. More accurate → complex model, more features
Mainly dominated by machine learning model
Re-ranking
Meet some specific business contrains or improve item diversity
Most of the time → simple business rules

Two Ways to Combine Embedding with Recommender

There are two ways to combine embedding and recommender. And both methods can be applied in both candidate generation and ranking stages

  1. Pre-traned embedding
  2. Embedding layer in model

Pre-trained Embedding

As the name indicated, the embedding is trained seperately before model training, and these embeddings will be concated with other features as the input of models.

In part 1 we introduced many methods to get embedding: Item2Vec, RandomWalk, Node2Vec, and GraphSAGE. All of their embedding can be used as pre-trained embedding to help the downstream tasks.

Here are some examples

Candidate generation and ranking model from Youtube, where vedeo embedding, search token embedding and language embeddding are pre-trained
Factorizaiton Neural Network: pre-trained embedding from Factorization Machine

Actually, to get pre-trained embedding, the methods are not constrained to what we’ve discussed in part 1. For example, in the work from Facebook, XGBoost is used to get pre-trained embedding, which is acutally leaf nodes of each tree. Then these info are feeded into to linear classifier to predict CTR

Model structure from Facebook

Embedding Layer of Model

For this case, the main function of embedding is to transform features from high dimentional sparse representation to low-dimentional dense representations.

Except user_id and item_id, we can actually get embedding for all categorical features. Even for continuous feature, in order to improve generalization and reduce over-fitting, it’s also a common practice to bin continuous features to categorical features. Then apply embedding layer.

Below are some examples

Due to the fact that there are extensive categorical features in recommender, embeddding layer are one of the most common component in recommender models
Matrix Factorization is actually a shallow neural network with one embedding layer

Pros and Cons

Pros
Cons
Pre-trained embdding
1. accellerate training for downstream tasks 2. improve the performance of downstream tasks
1. Since embeddign is not tailored for dowstream task, it might cause suboptimal performance
Embedding Layer
1. Learnt specifically for the task → potential better performance
1. Due to huge number of paramters of embeding, it takes time to train 2. If training data is not enough → potential bad performance

Generally speaking, for high cardinality features such as user_id and item_id, it’s always a good idea to pre-trained them with extensive sequence or graph data. In addition, it’s also a good option to initialize the embeddding layer with pre-trained embedding, then update them while training model.

Embedding-based Candidate Generation

One of the important application of embedding is embedding-based candidate generation. To understand the real-world practices in the following blogs, it’s crucial to understand it.

First, Forget About Recommender, What’s Vector and Vector Space?

I believe most of us learn about vector and vector space in linear algebra course, and they have extensive applications in math and physics.

Basically, vectors are mathematical entities that have both magnitude and direction. And they can often pictured as arrows pointing in a certain direction in a vector space.

Take 2D vector as example, below are several examples

image

There are ways to measure the similarity between vectors such as cosine similarity, dot product, Euclidian distance and so on. We’ll leave the details and comparison of these similarity measurement to future blogs. Here we only discusss the basic idea.

Take cosine similarity as an example, which mainly consider the angle between vectors and discard the length of vectors. For two vector A and B, the cosine similarity between them are:

cos(A,B)=ABABcos(A,B)=\frac{A\cdot B}{\parallel A \parallel \cdot \parallel B\parallel}

So for these 2D vectors above :

  1. cos(v1,v2)=1cos(v1, v2) = 1, which means v1 and v2 are the same
  2. cos(v1,v3)=0.8cos(v1, v3) = 0.8
  3. cos(v1,v4)=1cos(v1, v4) = -1, which means v1 and v4 are completely different

From Vector and Vector Space to Candidate Generation

Embedding based candidate generation is based on the knowledge of vector space and vector similarity, and it can be classified to two main categories:

  1. Content-based filtering, which is also called Item2Item.
  2. Collaborative Filtering, which includes User2Item and User2User2Item.

Content-based Filtering (Item2Item)

Basic Idea

The basic idea is to use similarity between items to recommend items similar to what the user like. For example, If user A watches two cute cat videos, then the system can recommend cute animal videos to that user.

How to Get this Embedding ?

Basically, all of the method that are introduced in part 1: Item2Vec, RandomWalk, Node2Vec and so on.

We can borrow an example from the recommender course. Assume there are two dimentions:

  1. X axis: whether the movie is for children (negative values) or adults (positive values)
  2. Y axis: the degree to which each movie is a blockbuster or an arthouse movie

Then all movies can be represented as 2-dimentional embedding below (let’s ignore the users for now)

image

Detailed Steps of Content-based Filtering

  1. Find the item that users liked, for example the Harry Potter movie
  2. Calculate the similarity (cosine similarity, dot similarity or whatever) between the embeding of this item and the rest of items
  3. Sort items based on similarity
  4. Get the top N items and recommend them

Collaborative Filtering (User2Item and User2User2Item)

Basic Idea

The basic idea is to use similarity between users and items simultaneously to provide recommendations. For example, If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn’t seen any videos similar to video 1).

How to Get this Embedding ?

At the begining, you might find it hard to understand the similarity between user and item , and why there are similarity between these two different entities.

Well, it actually should be the relevance between user and item or the similarity between user embedding and item embedding . So if user and item are relevance (for example user has clicked/booked/liked this item), we’ll try to make their embeddings as close as possible in the vector space.

For the example from previous section, based on users’ preference, users can also be mapped to 2-dimentional embeding. The embedding of user1 and movie Shrek is close (or similar), then we can recommend movie Shrek to user1

image

There are two main approaches to get the embedding, and the key is to map user and item to the same vector space, so that the similarity is measurable.

The first one is the the classic collaborative filtering method: Matrix Factorization. And the the downside is that context information cann’t be used, which constrains its performance

Matrix Factorization
Matrix Factorization

The second one, which is also the most extensive adopted one, is two-tower neural network. It can include more features and has the advantages of fast inference speed.

Two-tower neural network.
Two-tower neural network.

Detailed Steps of Collaborative Filtering

The steps are similar to that from content-based filtering. The main difference is the first and second step.

  1. Get the user information (and context information for two-tower neural network)
  2. Retrieve user and item embedding
    1. For matrix factorization, user and item embedding are stored in cache beforehands
    2. For two-towel neural network, item embeddings are pre-calculated and stored in cache; For user embedding, we’ll need to feed user and context information to user tower to get it
  3. Calculate the similarity between the user embeding and the embedding of all items.
  4. Sort items based on similarity
  5. Find the top N items and recommend them

Reference