- Recap & Introduction
- Graph Embedding
- Embedding-based Candidate Generation
- EGES from Alibaba (2017)
- Business & Challenges
- Existing Baseline
- Details of New Model
- Graph Construction & Random Walk Generation
- Base Graph Embedding (BGE)
- Graph Embedding with Side Information (GES)
- Enhanced Graph Embedding with Side Information (EGES)
- Performance
- PinSage from Pinterest (2018)
- Business & Challenges
- GraphSAGE
- From GCN to GraphSAGE
- Detailed Explanation
- PinSage
- Features (Side Information) of Nodes
- Random-walk Based Neighbor Sampling
- Importance Based Neighborhood Aggregator
- Max-margin Ranking Loss
- Sampling Negative Samples
- Curriculum Training
- Performance
- Final Thoughts
- Reference
Recap & Introduction
Graph Embedding
In part 1, we’ve learnt about two categories of graph embedding
- Embedding technique such as DeepWalk and Node2Vec, which are based on random walk
- Embedding technique such as GraphSAGE, which is based on Graph Convolutional Network(GCN).
Today, we’ll talk about two industry practices from each category
- Enhanced Graph Embedding with Side informaiton(EGES), which is based on DeepWalk, from the e-commerce platform Taobao (owned by Alibaba)
- PinSAGE, which is based on GCN, from the visual catalog platform Pinterest
Embedding-based Candidate Generation
In part 2, we know that embedding based candidate retrival is one of the important application of embedding technique in recommender. And one of them is content-based filtering (Item2Item), whose idea is to find items based on the inner product between embedding from query item and candidate item.
For both industry practices, the graph embedding technique are applied in the scenario of Item2Item:
- In Taobao platform, a two stage recommender is adopted: matching and ranking. Graph embedding is utlized in matching stage.
- The recommender system at Pinterest involves separate stages of candidate retrieve, ranking and re-ranking. PinSage might be used in candidate generation stage.
EGES from Alibaba (2017)
Business & Challenges
It’s mentioned that there are three challenges for the recommender in Taobao
- Scalability: Many existing recommenders failed on the large scale dataset in Taobao
- Sparsity: It’s difficulty to train recommender due to the fact that users tend to interact with only a small number of items
- Cold start: Each hour, there are million of new items uploaded with no related user behaviors. And it’s challenging to predict the preference of users for these items
Existing Baseline
Before graph embedding, the baseline model in production is a Collaborative Filtering (CF) based embedding, which calculates the similarity between two items according to item co-occurrence and user voting weight. In addition, the similarity measurement is well-tuned with a large number of hand-crafted heuristic strategies and it’s suitable for Taobao’s business, which means it represents a challenging basedline.
However, there are two main downsides for this baseline:
- It only considers the co-occurrence of items, but ignores the sequential information, which can reflect users’ preference more precisely
- CF based embedding can’t handle the challenge of sparsity and cold start very well
Details of New Model
To create the new embedding, the same steps from graph embedding based on random wallk are followed:
- Construct a graph data
- Apply random walk to the graph data: select a vertice randomly as a start point, then based on a predefine walling rule to generate a sequence of vertices
- Apply Item2Vec(mostly skip-gram with negative sampling) to these sequence data
Graph Construction & Random Walk Generation
Image above shows the detailed steps:
- First, the original data is session-based users’ behavior data, for example users’ click sequence.
- Invalid data and abnormal behaviors should be removed, such as unintentional clicks, over-active users…
- To reflect the similarity between different items based on all of the users’ behaviors, a
homogeous graph with directed weighted edges
(check here for the graph data in recommender) is generated: - The vertices of graph are items and two items are connnected if two items occur consecutively, e.g., in figure (b), item D and item A are connected because user accessed item D and A consecutively in figure (a)
- For the edge connecting item A and item B in figure (b), the direction represents the transition direction in the users’ behavior history; the weights of edge equals to the frequency of item A transiting to item B in the whole users’ behavior history
- In figure (c), after applying random walk, new sequential data are generated
Base Graph Embedding (BGE)
Compared to the orignal work of DeepWalk, there is no huge change in BGE except the fact the transition probability of random walk is adjusted based on the property of directed weighted graph.
Image beblow shows the the model architecture and please check here for more details about DeepWalk
Graph Embedding with Side Information (GES)
With BGE, higher-order item similarity can be learnt from users’ behavior sequences, which are ignored by the baseline of CF based method. However, it’s still challenging to learn accurate embedding for cold start
item, i.e., those with no interaction of users.
We’ve talks this problem in part 3, where side info is used to create embedding at Expedia. Similar solution is adopted here. In the context of recommender in e-commerce, side information refers to the category, shop, price, etc.
Model architecture is shown below, and is the final aggregated item embedding.
Compared to the practice from Expedia, the only difference is the method to combine different embeddings to create the final aggregated item embedding. Assume for item with types of side information, we have embeddings:
- In the Hotel2Vec work from Expedia, these embeddings are concated, then another layer of neural network is applyed to get the final item embedding ( is the model weights)
- In this work, an average-pooling operation are applied to these embeddings to get the final item embedding, and the formula of final embedding is
Enhanced Graph Embedding with Side Information (EGES)
Despite the performance gain of GES, the assumption of average pooling operation, that each embedding contributes equally to the final embedding, doesn’t reflect the reality. There is an example from the orginal paper:
For example, a user who has bought an iPhone tends to view MacBook or iPad because of the brand “Apple,” while a user may buy clothes of different brands in the same shop on Taobao for convenience and lower prices.
To address this issue, EGES was proposed, whose key idea is that different types of side informations have different contributions while creating the final embedding. Below is the model structure:
Compared to that of GES, the only difference is that weights are assigned to each embedding while creating the final embedding. So formua of final embedding is
Where
- is used instead of to make sure the contribution of each embedding are greater than 0
- is used to normalize the weights
Performance
As mentioned before, the application scenario is Item2Item candidate generation in matching stage, after which there is deep learning powered ranking stage. Online test was conducted to measure the performance, and CTR of the recommender engine is the key metric. Below is the performance
PinSage from Pinterest (2018)
Business & Challenges
Pinterest is a visual catalog with several billion pins
, which are visual bookmarks containing a description, a link, and an image or a video. Users at Pinterest view pins
and curate them into collections called boards
. This way, a single pin can be saved by thousands of users into tens of thousands of different boards.
Pinterest can viewed as a biprtite graph of 7 billion pins and boards, and over 100 billon edges
Related pin recommendation is one of the recommendation scenario in Pinterest, and embedding based Item2Item recommendation is a good solution. However, colaborative filtering based embedding and random walk based embedding are not applicable in the scenario of Pinterest:
- These embeddings are trained on the whole graph data, and when there are no data added, these embeddings should be re-trained again. However, at Pinterest, every second there are new pins uploaded, which makes it infeasible to re-training these embedding all the times
- For these method, in the end, every item will have a corresponding embedding. However, the billions of pins on Pinterest translate to significant storage overhead for these embeddings.
Based on GraphSAGE algorithm, PinSage was developed to address the aforementioned challenges. In the following section, we’ll first explain GraphSAGE then delve into its adaptions in PinSage
GraphSAGE
From GCN to GraphSAGE
Graph Convolutional Network (GCN) is a type of graph neural network, which borrows the core idea of Convolutional Neural Network (CNN) and adapts it to work on graph data:
- CNN is mostly used in image data to learn local patterns with the sliding window convolution operation. And GCN adopts similar
convolution
operation to aggregate feature information from a node’s local graph neighborhood. - Just like CNNs learn filters that activate on certain patterns in images, GCNs learn a function/filter that extracts patterns from a node's neighborhood in the graph
GCN has been proved to be an effective way to both leverage content information and graph structure. However, to apply it to industries with huge graph data, there is a main setback: GCN should be trained on the entire graph data and it aggregates information from all neighbors of a node. This setback makes it infeasible to apply GCN to a graph with billions of nodes and whose stucture is constantly evolving.
Then GraphSAGE was invented to solved the above problem, and the key idea is to learn a function to aggregate feature information from a node’s local neighborhood, then use this function and local neighborhood information to create embedding.
Compared to normal GCN, the main differences are:
- Instead of aggrgating information from all neighbors, GraphSAGE uses a sampling strategy to sample a fixed number of neighbors for each node and aggregate their information
- GraphSAGE is designed to be an inductive framework which can generalize to unseen nodes in large or dynamic graph. For node embedding, instead of learning distinct embedding for each item, a aggregated function is learnt which can map features to embedding.
Detailed Explanation
The name GraphSAGE (Graph SAmple and aggreGatE) indicates there are two main components
Sample
: a method to sample a fixed number of neighbors for a nodeAggregate
: a method to aggregate feature information from neighbors
To understand GraphSAGE, let’s try to answer three key questions in this section:
- How to sample the neighbors of a node ?
- Assume GraphSAGE has been trained and model paramters are fixed now, how to get the node embedding ?
- How to train GraphSAGE?
How to sample the neighbors of a node ?
For node A from image below, its neighbors are
- 1st-hop neighbor: B, C, D
- 2nd-hop neighbor: E, F
For a large graph, a node may have too many neighbors, which slows the computation. So sampling strategy is applied to select fixed number of neigherbors
from each node. For GraphSAGE, random sampling is applied.
If we set number of neighbor to 1, then for node A, its sampled neighbors might be
- 1st-hop neighbor: C
- 2nd-hop neighbor: F
How to get the node embedding (forward propagation) ?
The intuition is that:
- There are several layers(or depth) of convolutional layers, each layer aggragates informations the local neighbors
- As this process iterates, nodes incrementally gain more and more information from further reaches of the graph
Assume number of layers is ; input feature of node is . Formula below shows how to generate the embedding for node A, namely :
There are different types of aggregators such as mean aggregator, LSTM aggregator and pooling aggregator.
How to train GraphSAGE (backward propagation)?
To train the weights and , an unsupervised way is adopted, which is similar to that from DeepWalk.
- If two nodes co-occurs consecutively on fixed-length random walk, then their are positive pairs; otherwise they are negative pairs
- The loss is the negative sampling loss. The loss function encourage nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct
PinSage
Although based on GraphSAGE, PinSage introduced several changes to handle the massive scale of Pinterest’s recommendation graph effectively. We’ll mainly focus on these changes in the following section.
Features (Side Information) of Nodes
In terms of the input feature of nodes, whilch will also be the embedding in layer 0 , in PinSage, following features are used
- Visual information: visual embedding generated with deep learning
- Textual anonotations (title, description): annotation embedding generated with deep learning
- Graph structural information: log degree of the node/pin in the graph
Random-walk Based Neighbor Sampling
Due to the large scale of graph data in Pinterest, every pin
have plenty of neighbors. Random sampling
strategy from GraphSAGE might lose important information. To solved this issue, importance based neighbor sampling is a better solution.
For a node , instead selecting item randomly from it’s immediate neighbors, following the step below:
- Simulate several random-walks from node
- Calculate the -normalized visit count of nodes visited by the random walk
- Top nodes with the highest normalized visit counts are defined as neighbors of node
(This random-walk method is actually a previous work called Pixie from Pinterest. For more details, please refer to Pixie: A System for Recommending 3+ Billion Items to 200+Million Users in Real-Time)
Importance Based Neighborhood Aggregator
Since importance are considered while sampling neighbors, it’s better also cosider importance while aggregated neighborhood information. The normalized visit counts from the step of random-walk based neighbor sampling are used as weights for each neighbor. Then it becomes an importance based pooling method
Max-margin Ranking Loss
Model is trained in an supervised fasion, and max-margin ranking loss is applied.
Assume and represented the embedding of query item and positive sample, and denotes the distribution of negative sample of query item
Sampling Negative Samples
As mentioned above, negative sampling is used in the loss function. For PinSage, there are two changes to the traditional sampling strategy.
First, normaly, for each nodes, a fixed number of negative samples should be sampled. To improve efficiecy when training with large batch size, 500 negative samples are sampled and are shared by all training example in each minibatch. Empirically, no performance difference are obseved compared to the tradional way.
Second, there are 2 billion items at Pinterest. Sampling 500 samples from 2 billions item means that these 500 negative samples are most of the time completely different from the query item. It also means that it’s too easy for model to learn these difference and model can’t handle well when negative samples are similar to the query items. To solved this problem, hard negative samples
are introduced: these negative should be somewhat related to the query item, but they are still not positive samples. And they are generated by ranking items in a graph according to their Personalized PageRank scores with respect to query item. Items ranked at 2000-5000 are randomly sampled as hard negative items.
(Actually sampling strategy of negative samples is an important topic for candidate generation, we’ll talk more about it in future blogs)
Curriculum Training
Including hard negative sample
help models learn better, but it makes model learn slower: it doubled the number of epochs needed for model to converge. To solve this issue, a curriculum training strategy was adopted, whose basic idea is to learn soft negative first, then gradually learn hard negative sample.
- In the first epoch of training, no hard negative sample are used, so that model can quickly finds an area in parameter space where the loss is relative low
- In seubsequent epoches, hard negative sample are added for each query item. In this case, it helps model to gradually learn how to distinguish highly related pins from only slighted related ones.
- Put it differently: At epoch of the training, hard negative items are added to the set of negative items for each item.
Performance
Online test was conducted, and here are the performance:
- It’s found that PinSage consistently recommends pins that are more likely to be re-pinned by the user than the alternative methods.
- Depending on the particular setting, 10-30% improvements in repin rate over the Annotation and Visual embedding based recommendations has been observed.
Final Thoughts
Although the above two works are from different industries and based on different techniques. We do observed some common methodologies, for example,
- Including more side information to improve embedding and solve cold-start problem.
- Importance based embedding aggregatation.