Embedding in Recommender - Part 1: An Overview

Embedding in Recommender - Part 1: An Overview

Tags
Data Science
Date Published
April 7, 2024
💡
This is the first blog of the series Embedding in Recommender, and it mainly gives an high-level overview and basic idea about embedding. More details and industry practices will be found in the future blogs.
Blogs from the series of Embedding in Recommender

Preliminary

Feature Representation and Machine Learning Models

For machine learning, most of the time, data need to be processed to features and then feeded into the model. While training, model will try to understand the replationship between features and the target.

Features can be presented in different forms. For example, in an e-commerce platform, user_id can be used to represent a user. We can also use demographic infos to represent a user, for example, there is a male user who’s aged at 31, this user can be represented as a vector: [0, 31].

Different feature representations will influence the learning effeciency of machine learning models. So finding the optimal feature representation to mamize model learning is very important.

What is Embedding ?

Embedding is feature representation technique to map entities(e.g. word in language, user and item) to a lower dimentional continous vectors. And these vectors should reflect the property of the original entities, for example, if two entities are similar in reality, the distance between their embedding vectors should be close.

Since this technique was first introduced in natural language processing (NLP), let’s take word emedding as an example to explain this concpet in details:

There are three words: Google, Microsoft and China. Compared to China , Google should be more similar to Mcrosoft since they are both companies, so distance between Embedding(Google) and Embedding(Microsoft) is smaller than that between Embedding(Google) and Embdding(China)

Another example is that the distance between Embedding(King) and Embedding(Queen) should be closer to that between Embedding(Male) and Embdding(Female).

Why is Embedding Technique so Important for Recommender?

There are mainly two reasons: In the scenario of recommender

  1. there are so many categorical features and entities (such as user, item, user_country, several level of item categories), machine learning model especially deep learning model can’t learn efficiently from these sparse high-dimentional features. So embedding is needed to transform these features to dense lower-dimentional features
  2. The are so many different interactions between these entities (for example a user can visit/save/buy a item and this item might belong to category, which is another user’s favorate category). Different kinds of data with rich information are created during these interactions. Embedding technique can learn and reflect these info into the embeding vector, which helps machine learning model

Embedding Technique

Based on the technique and the data used, these embedding techniques can be classified into three main categories

  1. Embedding based on collaborative filtering where structured data are used
  2. Item2Vec where sequence data such as user-click-session data is used
  3. Node2Vec and GraphSAGE where graph data is used

Collaborative Filtering Embedding

Collaborative filtering is a classic recommender algorithm, and it learns the similarity between users and items as well as user behavior, then it predicts potential preferred items to the users

Matrix Factorization (MF)

Basic Idea

In most scenarios, user’s feedback(or rating or preference) to items are recorded. For example, after checking an item, if user booked it, then the feedback (or rating) is 1; if no booking, then the feedback is 0. Then

  1. Based on this data, we can create a user-item rating matrix representing users’ rating to all items
  2. With the help of numerical methods,this matrix can be decomposed to user embedding maxtrix and item embeding matrix . Then for every item and user, there is a embedding
  3. When we want to know if user1 like item1 or item2, we can calculate the inner product between user embedding and item embedding. The higher the result, the higher user’s preference is
image

Numerical Solutions

There are different numerical methods to decompose the user-item rating matrix

  1. Alternative Least Square (ALS)
  2. Single Value Decompositon (SVD)
  3. Gradient decient
  4. Pairwise learning with Bayesian Personalized Ranking (BPR)

Factorization Machine (FM)

Basic Idea

Overall, instead of only emebedding user_id and item_id, factorization machine extends MF to also get embeddings for other features.

In MF, only user_id , item_id and user’s feedback are used, however, there are much more features that are not used. On contract, linear regression can use more features. Factorization machine extends the idea of linear regression to include high-order feature interaction, at the same time, it borrowed the idea of MF, to map feature to vectors.

More tech details about FM

FM + Multi Layer Perception (MLP)

There are embedding layers in MLP, howeve, it’s hard to training these embedding due to the huge amount of parameters. If we use embeddings from FM to initialize the embedding layer in neural network, it become easier.

One of these works is Factorization Neural Network (FNN).

FNN architecture

Embedding Based on Sequence Data

Sequence Data in Recommender

As shown in this blog, including more features is one of evolutionary trend for recommender since more useful data help us to make better decision.

In the scenario of recommender, there are always some kinds of sequence data which are generated by users. For example

  1. While shopping in an e-commence platform, a user migh click and check multiple items, which create item-sequence data
  2. In the last one year, a user wached multiple TV series in Netflix, which creats a TV series sequence data

The assumption is that, for these items or TV series that exist within the same sequence, there must be some similarity among them, or these sequences of items or TV series must meet user’s preference. If we can get embeding including the above info, it will be very useful

Item2Vector is the technique that we want, which is actually based on Word2Vec from Natural Language Processing (NLP). And DeepWalk and Node2Vector in the next section are also based on Word2Vector. Then understanding Word2Vec become crucial. We’ll first dig into the details of Word2Vec, then talk about Item2Vec

Word2Vec

Let’s take a source text The quick brown fox jumps over the lazy dog as an example to explain Word2Vec

Basic Ideas

The idea of Word2Vec is that if two words co-exist, there exists some kinds of similarity between these words. To define co-exist, we’ll need to define a window in the text sequence. Assume the window_length = 2 , below are an example of a window, where brown is the center word and The, quick , fox and jumps are context word

image

So, words The, quick , brown , fox and jumps co-exist in the same window, then there exsist similarity between these five words. Then what kind of similarity or relationship are there?

There are two assumptions which corresponds to the two models in Word2Vec

  1. Skip-gram: in a text sequence, a word (center word) can be used to generate the words that surround it (context word) → brown can be used to predict The, quick , fox and jumps
  2. Continuous Bag of Words (CBOW): in a text sequence, the words surrounding the center word can be used to generate the center word → The, quick , fox and jumps can be used to predict brown
CBOW and Skip-gram

Skip-gram is proved to have a better performance. So we’ll focus on skip-gram here

Skip-gram with Negative Sampling

In skip-gram, while training, given a center word, the model needs to predict the context word. So the problem is actually a multi-label classification problem, where the number of labels ranges from 10510^{5} to 10710^{7}. In this case, the training become very slow.

To accelerate the training, approximate training is needed. There are two approximation methods:

  1. Hierarchical softmax
  2. Negative sampling.

Since negative sampling is proved to have better performance, we’ll focus on negative sampling here.

The idea for negative sampling is simple: it changes the problem from multi-label classification to a binary classification problem: predicting whether the context word is within the context windows of center words. And we can random select some words as negative samples. In this case, the training become more effecient.

Exampel to get training samples for skip-gram with negative sampling
Mathmatical representation of skip-gram with negative sampling

Practical Tricks While Creating Training Samples

In text data, there are some words that appears in high frequence, such as the , a and in in English. It turns out that model can learn less information when they appear as positive sample, but learn more information when they appears as negative samples. So for these high-frequence words, we’ll need to

Subsampling them when they are positive samples
Oversampling them when they are negative samples

These two tricks is not only applicable to Word2Vec, but also Item2Vec, where, for example in e-commerce, there are always popular items.

From Word2Vec to Item2Vec

When applying Word2Vec to the sequence data in recommender, we get Item2Vec. In terms of implementation, Item2Vec is almost the same as Word2Vec.

The only difference is: in Item2Vec, the spatial (or timestamp) information is ignored, which means although in both method center word (or item) are used to predict context word (or item). The definition of the context word (or item) are different

  1. In Word2Vec, context word (or item) are selected from the words(or items) inside the context window
  2. In Item2Vec, context word (or item) are selected from the words(or items) that are inside the same sequence
If you’re interested, here is the mathmetical representation

In addition, for different business scenarios, different adaptions to Word2Vec are needed. In the next blog, I’ll show these interesting adaptions from some real-world cases.

Embedding Based on Graph Data

Graph Data in Recommender

It’s common that data can be represented as graph, and it’s the same in the context of recommender. Below are several different graph data from the real-world scenarios:

Homogeneous graph: all the nodes are of the same type
Bipartite graph: unique structure with two node sets, where edges connect nodes from different sets
Heterogeneous graph: there are diverse node categories

For graph data, what we want is node(vertice) embedding. It’s better that this embedding can include the following informations:

  1. Graph topilogical information about local community and structural role.
  2. Node property information.

From Word2Vec to DeepWalk

Basic Idea

DeepWalk was invented to learn vertice embedding from homogenerous graph. It’s an important work to introduce Word2Vec to graph data. And there are mainly two steps:

  1. Extract the information from the graph → get sequence data through random walk in the graph data
  2. Capture and learn this information → apply Word2Vec to these sequence data

For step 2, there isn’t so many new things. So we’ll focus on step 1: random walk.

Random Walk

The basic idea of random walk is to select an vertice randomly in graph as the start point, then based on a predefined walking rules to generate a sequece of vertices.

Actually before the work of DeepWalk, random walk has been proved to be an effecient tool to extract the local community structure information from graph.

Random walk mathmatical representation

Why Can Word2Vec be Applied Here?

But there is another question: even though we can get the sequence data with random walk, why can we apply Word2Vec to them ? and why does it work?

The answer to the above quesition is actually one of the main contribution of this work: it’s observed that the frequency which vertice appear in the short random walk following a power-law distribution. While word frequency in natural language follows the similar distribution, so techniques from language modelling(such as Word2Vec) can be re-purposed to capture information from short random walk sequences

Image below shows this similarity

From DeepWalk to Node2Vec

As mentioned above, in the first step of DeepWalk, graph information (such as similarity between vertices) is extracted with random walk. However, it turns out random walk is not effecient to extract the diversity of connectivity pattern in the graph.

There are two types of important information in graph:

  1. Homophily: nodes share similarity with their neighbor nodes
  2. Structure equivalence: nodes share similarity with the nodes which have same structural role in graph

The work of Node2Vec tried to extract these two type of information by adopting a biased random walk. And this biased random walk combine

  1. Breadth Frist Sampling (BFS) to find local neighbor nodes.
    1. In the plot below, node uu and s1s_1 are neighbors and node s1s_1 can be reached through BFS
  2. Depth First Sampling (DFS) to find nodes that share similar topilogical property
    1. In the plot below, node uu and s6s_6 have similar structural role, where they are both the center of subgraphs. And node s6s_6 can be reached through DFS
image
Biased Random walk mathmatical representation

GraphSAGE (SAmple and aggreGatE)

In real-world recommender, there are tens of millions of nodes and edges, and also there are always new nodes and edges added to the graph. These two properties impose challenges to DeepWalk and Node2Vec:

  1. For these two method, since every node has an embedding vector, the number of paramters is linear to the number of nodes. In this case, it is not scalable for tens of millions of nodes
  2. Ideally, these two method is only applicable on fixed graph, which means the model needs to be re-trained when new nodes are added.

Then GaphSAGE was introduced, which instead of getting individual embeddings for each node, a function is learned to generate embeddings by sampling and aggregating features from a node’s local neighborhood. In additional, due to the property of graph convolutional network, GraphSAGE also include other node features.

Unfortuntely, due to space limitations of this blog, the details of GraphSAGE will be discussed in future blog posts

GraphSAGE sample and aggregate approach

Reference