- Preliminary
- Feature Representation and Machine Learning Models
- What is Embedding ?
- Why is Embedding Technique so Important for Recommender?
- Embedding Technique
- Collaborative Filtering Embedding
- Matrix Factorization (MF)
- Factorization Machine (FM)
- Embedding Based on Sequence Data
- Sequence Data in Recommender
- Word2Vec
- From Word2Vec to Item2Vec
- Embedding Based on Graph Data
- Graph Data in Recommender
- From Word2Vec to DeepWalk
- From DeepWalk to Node2Vec
- GraphSAGE (SAmple and aggreGatE)
- Reference
Preliminary
Feature Representation and Machine Learning Models
For machine learning, most of the time, data need to be processed to features and then feeded into the model. While training, model will try to understand the replationship between features and the target.
Features can be presented in different forms. For example, in an e-commerce platform, user_id
can be used to represent a user. We can also use demographic infos to represent a user, for example, there is a male user who’s aged at 31, this user can be represented as a vector: [0, 31].
Different feature representations will influence the learning effeciency of machine learning models. So finding the optimal feature representation to mamize model learning is very important.
What is Embedding ?
Embedding is feature representation technique to map entities(e.g. word in language, user and item) to a lower dimentional continous vectors. And these vectors should reflect the property of the original entities, for example, if two entities are similar in reality, the distance between their embedding vectors should be close.
Since this technique was first introduced in natural language processing (NLP), let’s take word emedding as an example to explain this concpet in details:
There are three words: Google
, Microsoft
and China
. Compared to China
, Google
should be more similar to Mcrosoft
since they are both companies, so distance between Embedding(Google)
and Embedding(Microsoft)
is smaller than that between Embedding(Google)
and Embdding(China)
Another example is that the distance between Embedding(King)
and Embedding(Queen)
should be closer to that between Embedding(Male)
and Embdding(Female)
.
Why is Embedding Technique so Important for Recommender?
There are mainly two reasons: In the scenario of recommender
- there are so many categorical features and entities (such as user, item, user_country, several level of item categories), machine learning model especially deep learning model can’t learn efficiently from these sparse high-dimentional features. So embedding is needed to transform these features to dense lower-dimentional features
- The are so many different interactions between these entities (for example a user can visit/save/buy a item and this item might belong to category, which is another user’s favorate category). Different kinds of data with rich information are created during these interactions. Embedding technique can learn and reflect these info into the embeding vector, which helps machine learning model
Embedding Technique
Based on the technique and the data used, these embedding techniques can be classified into three main categories
- Embedding based on collaborative filtering where structured data are used
- Item2Vec where sequence data such as user-click-session data is used
- Node2Vec and GraphSAGE where graph data is used
Collaborative Filtering Embedding
Collaborative filtering is a classic recommender algorithm, and it learns the similarity between users and items as well as user behavior, then it predicts potential preferred items to the users
Matrix Factorization (MF)
Basic Idea
In most scenarios, user’s feedback(or rating or preference) to items are recorded. For example, after checking an item, if user booked it, then the feedback (or rating) is 1; if no booking, then the feedback is 0. Then
- Based on this data, we can create a
user-item rating matrix
representing users’ rating to all items - With the help of numerical methods,this matrix can be decomposed to
user embedding maxtrix
anditem embeding matrix
. Then for every item and user, there is a embedding - When we want to know if user1 like item1 or item2, we can calculate the inner product between user embedding and item embedding. The higher the result, the higher user’s preference is
Numerical Solutions
There are different numerical methods to decompose the user-item rating matrix
- Alternative Least Square (ALS)
- Single Value Decompositon (SVD)
- Gradient decient
- Pairwise learning with Bayesian Personalized Ranking (BPR)
Factorization Machine (FM)
Basic Idea
Overall, instead of only emebedding user_id and item_id, factorization machine extends MF to also get embeddings for other features.
In MF, only user_id
, item_id
and user’s feedback are used, however, there are much more features that are not used. On contract, linear regression can use more features. Factorization machine extends the idea of linear regression to include high-order feature interaction, at the same time, it borrowed the idea of MF, to map feature to vectors.
FM + Multi Layer Perception (MLP)
There are embedding layers in MLP, howeve, it’s hard to training these embedding due to the huge amount of parameters. If we use embeddings from FM to initialize the embedding layer in neural network, it become easier.
One of these works is Factorization Neural Network (FNN).
Embedding Based on Sequence Data
Sequence Data in Recommender
As shown in this blog, including more features is one of evolutionary trend for recommender since more useful data help us to make better decision.
In the scenario of recommender, there are always some kinds of sequence data which are generated by users. For example
- While shopping in an e-commence platform, a user migh click and check multiple items, which create item-sequence data
- In the last one year, a user wached multiple TV series in Netflix, which creats a TV series sequence data
The assumption is that, for these items or TV series that exist within the same sequence, there must be some similarity among them, or these sequences of items or TV series must meet user’s preference. If we can get embeding including the above info, it will be very useful
Item2Vector is the technique that we want, which is actually based on Word2Vec from Natural Language Processing (NLP). And DeepWalk and Node2Vector in the next section are also based on Word2Vector. Then understanding Word2Vec become crucial. We’ll first dig into the details of Word2Vec, then talk about Item2Vec
Word2Vec
Let’s take a source text The quick brown fox jumps over the lazy dog
as an example to explain Word2Vec
Basic Ideas
The idea of Word2Vec is that if two words co-exist, there exists some kinds of similarity between these words. To define co-exist
, we’ll need to define a window
in the text sequence. Assume the window_length = 2
, below are an example of a window, where brown
is the center word
and The
, quick
, fox
and jumps
are context word
So, words The
, quick
, brown
, fox
and jumps
co-exist in the same window, then there exsist similarity between these five words. Then what kind of similarity or relationship are there?
There are two assumptions which corresponds to the two models in Word2Vec
- Skip-gram: in a text sequence, a word (
center word
) can be used to generate the words that surround it (context word
) →brown
can be used to predictThe
,quick
,fox
andjumps
- Continuous Bag of Words (CBOW): in a text sequence, the words surrounding the center word can be used to generate the center word →
The
,quick
,fox
andjumps
can be used to predictbrown
Skip-gram is proved to have a better performance. So we’ll focus on skip-gram here
Skip-gram with Negative Sampling
In skip-gram, while training, given a center word, the model needs to predict the context word. So the problem is actually a multi-label classification problem, where the number of labels ranges from to . In this case, the training become very slow.
To accelerate the training, approximate training is needed. There are two approximation methods:
- Hierarchical softmax
- Negative sampling.
Since negative sampling is proved to have better performance, we’ll focus on negative sampling here.
The idea for negative sampling is simple: it changes the problem from multi-label classification to a binary classification problem: predicting whether the context word is within the context windows of center words. And we can random select some words as negative samples. In this case, the training become more effecient.
Practical Tricks While Creating Training Samples
In text data, there are some words that appears in high frequence, such as the
, a
and in
in English. It turns out that model can learn less information when they appear as positive sample, but learn more information when they appears as negative samples. So for these high-frequence words, we’ll need to
These two tricks is not only applicable to Word2Vec, but also Item2Vec, where, for example in e-commerce, there are always popular items.
From Word2Vec to Item2Vec
When applying Word2Vec to the sequence data in recommender, we get Item2Vec. In terms of implementation, Item2Vec is almost the same as Word2Vec.
The only difference is: in Item2Vec, the spatial (or timestamp) information is ignored, which means although in both method center word (or item) are used to predict context word (or item). The definition of the context word (or item) are different
- In Word2Vec, context word (or item) are selected from the words(or items) inside the context window
- In Item2Vec, context word (or item) are selected from the words(or items) that are inside the same sequence
In addition, for different business scenarios, different adaptions to Word2Vec are needed. In the next blog, I’ll show these interesting adaptions from some real-world cases.
Embedding Based on Graph Data
Graph Data in Recommender
It’s common that data can be represented as graph, and it’s the same in the context of recommender. Below are several different graph data from the real-world scenarios:
For graph data, what we want is node(vertice) embedding. It’s better that this embedding can include the following informations:
- Graph topilogical information about local community and structural role.
- Node property information.
From Word2Vec to DeepWalk
Basic Idea
DeepWalk was invented to learn vertice embedding from homogenerous graph. It’s an important work to introduce Word2Vec to graph data. And there are mainly two steps:
- Extract the information from the graph → get sequence data through random walk in the graph data
- Capture and learn this information → apply Word2Vec to these sequence data
For step 2, there isn’t so many new things. So we’ll focus on step 1: random walk.
Random Walk
The basic idea of random walk is to select an vertice randomly in graph as the start point, then based on a predefined walking rules to generate a sequece of vertices.
Actually before the work of DeepWalk, random walk has been proved to be an effecient tool to extract the local community structure information from graph.
Why Can Word2Vec be Applied Here?
But there is another question: even though we can get the sequence data with random walk, why can we apply Word2Vec to them ? and why does it work?
The answer to the above quesition is actually one of the main contribution of this work: it’s observed that the frequency which vertice appear in the short random walk following a power-law distribution. While word frequency in natural language follows the similar distribution, so techniques from language modelling(such as Word2Vec) can be re-purposed to capture information from short random walk sequences
From DeepWalk to Node2Vec
As mentioned above, in the first step of DeepWalk, graph information (such as similarity between vertices) is extracted with random walk. However, it turns out random walk is not effecient to extract the diversity of connectivity pattern in the graph.
There are two types of important information in graph:
- Homophily: nodes share similarity with their neighbor nodes
- Structure equivalence: nodes share similarity with the nodes which have same structural role in graph
The work of Node2Vec tried to extract these two type of information by adopting a biased random walk. And this biased random walk combine
- Breadth Frist Sampling (BFS) to find local neighbor nodes.
- In the plot below, node and are neighbors and node can be reached through BFS
- Depth First Sampling (DFS) to find nodes that share similar topilogical property
- In the plot below, node and have similar structural role, where they are both the center of subgraphs. And node can be reached through DFS
GraphSAGE (SAmple and aggreGatE)
In real-world recommender, there are tens of millions of nodes and edges, and also there are always new nodes and edges added to the graph. These two properties impose challenges to DeepWalk and Node2Vec:
- For these two method, since every node has an embedding vector, the number of paramters is linear to the number of nodes. In this case, it is not scalable for tens of millions of nodes
- Ideally, these two method is only applicable on fixed graph, which means the model needs to be re-trained when new nodes are added.
Then GaphSAGE was introduced, which instead of getting individual embeddings for each node, a function is learned to generate embeddings by sampling and aggregating features from a node’s local neighborhood. In additional, due to the property of graph convolutional network, GraphSAGE also include other node features.
Unfortuntely, due to space limitations of this blog, the details of GraphSAGE will be discussed in future blog posts