Embedding in Recommender - Part 4: Industry Practices in Graph Embedding

Embedding in Recommender - Part 4: Industry Practices in Graph Embedding

Tags
Data Science
Date Published
May 18, 2024
Blogs from the series of Embedding in Recommender

Recap & Introduction

Graph Embedding

In part 1, we’ve learnt about two categories of graph embedding

  1. Embedding technique such as DeepWalk and Node2Vec, which are based on random walk
  2. Embedding technique such as GraphSAGE, which is based on Graph Convolutional Network(GCN).

Today, we’ll talk about two industry practices from each category

  1. Enhanced Graph Embedding with Side informaiton(EGES), which is based on DeepWalk, from the e-commerce platform Taobao (owned by Alibaba)
  2. PinSAGE, which is based on GCN, from the visual catalog platform Pinterest

Embedding-based Candidate Generation

In part 2, we know that embedding based candidate retrival is one of the important application of embedding technique in recommender. And one of them is content-based filtering (Item2Item), whose idea is to find items based on the inner product between embedding from query item and candidate item.

For both industry practices, the graph embedding technique are applied in the scenario of Item2Item:

  1. In Taobao platform, a two stage recommender is adopted: matching and ranking. Graph embedding is utlized in matching stage.
  2. The recommender system at Pinterest involves separate stages of candidate retrieve, ranking and re-ranking. PinSage might be used in candidate generation stage.

EGES from Alibaba (2017)

Business & Challenges

Taobao is an e-commerce platform in China, which, in 2017, has one billion users and two billion items.

It’s mentioned that there are three challenges for the recommender in Taobao

  1. Scalability: Many existing recommenders failed on the large scale dataset in Taobao
  2. Sparsity: It’s difficulty to train recommender due to the fact that users tend to interact with only a small number of items
  3. Cold start: Each hour, there are million of new items uploaded with no related user behaviors. And it’s challenging to predict the preference of users for these items

Existing Baseline

Before graph embedding, the baseline model in production is a Collaborative Filtering (CF) based embedding, which calculates the similarity between two items according to item co-occurrence and user voting weight. In addition, the similarity measurement is well-tuned with a large number of hand-crafted heuristic strategies and it’s suitable for Taobao’s business, which means it represents a challenging basedline.

However, there are two main downsides for this baseline:

  1. It only considers the co-occurrence of items, but ignores the sequential information, which can reflect users’ preference more precisely
  2. CF based embedding can’t handle the challenge of sparsity and cold start very well

Details of New Model

To create the new embedding, the same steps from graph embedding based on random wallk are followed:

  1. Construct a graph data
  2. Apply random walk to the graph data: select a vertice randomly as a start point, then based on a predefine walling rule to generate a sequence of vertices
  3. Apply Item2Vec(mostly skip-gram with negative sampling) to these sequence data

Graph Construction & Random Walk Generation

image

Image above shows the detailed steps:

  1. First, the original data is session-based users’ behavior data, for example users’ click sequence.
    1. Invalid data and abnormal behaviors should be removed, such as unintentional clicks, over-active users…
  2. To reflect the similarity between different items based on all of the users’ behaviors, a homogeous graph with directed weighted edges(check here for the graph data in recommender) is generated:
    1. The vertices of graph are items and two items are connnected if two items occur consecutively, e.g., in figure (b), item D and item A are connected because user U1U1 accessed item D and A consecutively in figure (a)
    2. For the edge connecting item A and item B in figure (b), the direction represents the transition direction in the users’ behavior history; the weights of edge equals to the frequency of item A transiting to item B in the whole users’ behavior history
  3. In figure (c), after applying random walk, new sequential data are generated

Base Graph Embedding (BGE)

Compared to the orignal work of DeepWalk, there is no huge change in BGE except the fact the transition probability of random walk is adjusted based on the property of directed weighted graph.

Image beblow shows the the model architecture and please check here for more details about DeepWalk

image

Graph Embedding with Side Information (GES)

With BGE, higher-order item similarity can be learnt from users’ behavior sequences, which are ignored by the baseline of CF based method. However, it’s still challenging to learn accurate embedding for cold start item, i.e., those with no interaction of users.

We’ve talks this problem in part 3, where side info is used to create embedding at Expedia. Similar solution is adopted here. In the context of recommender in e-commerce, side information refers to the category, shop, price, etc.

Model architecture is shown below, and HH is the final aggregated item embedding.

image

Compared to the practice from Expedia, the only difference is the method to combine different embeddings to create the final aggregated item embedding. Assume for item vv with nn types of side information, we have n+1n+1 embeddings: Wv0,Wv1...Wvn,W_v^0, W_v^1...W_v^n,

  1. In the Hotel2Vec work from Expedia, these embeddings are concated, then another layer of neural network is applyed to get the final item embedding (WeW_e is the model weights)
Hv=ReLU(concate({Wvs,s=1,2,...n})TWe)H_v = ReLU(concate(\{W_v^s, s=1, 2,...n\})^TW_e)
  1. In this work, an average-pooling operation are applied to these embeddings to get the final item embedding, and the formula of final embedding is
Hv=1n+1s=0nWvsH_v=\frac{1}{n+1}\sum_{s=0}^{n}W_v^s

Enhanced Graph Embedding with Side Information (EGES)

Despite the performance gain of GES, the assumption of average pooling operation, that each embedding contributes equally to the final embedding, doesn’t reflect the reality. There is an example from the orginal paper:

For example, a user who has bought an iPhone tends to view MacBook or iPad because of the brand “Apple,” while a user may buy clothes of different brands in the same shop on Taobao for convenience and lower prices.

To address this issue, EGES was proposed, whose key idea is that different types of side informations have different contributions while creating the final embedding. Below is the model structure:

image

Compared to that of GES, the only difference is that weights are assigned to each embedding while creating the final embedding. So formua of final embedding is

Hv=s=0neavsWvss=0neavsH_v=\frac{\sum_{s=0}^{n}e^{a_v^s}W_v^s}{\sum_{s=0}^n e^{a_v^s}}

Where

  1. eavse^{a_v^s} is used instead of avsa_v^s to make sure the contribution of each embedding are greater than 0
  2. s=0neavs\sum_{s=0}^n e^{a_v^s} is used to normalize the weights

Performance

As mentioned before, the application scenario is Item2Item candidate generation in matching stage, after which there is deep learning powered ranking stage. Online test was conducted to measure the performance, and CTR of the recommender engine is the key metric. Below is the performance

image

PinSage from Pinterest (2018)

Business & Challenges

Pinterest is a visual catalog with several billion pins, which are visual bookmarks containing a description, a link, and an image or a video. Users at Pinterest view pins and curate them into collections called boards. This way, a single pin can be saved by thousands of users into tens of thousands of different boards.

Pinterest can viewed as a biprtite graph of 7 billion pins and boards, and over 100 billon edges

image

Related pin recommendation is one of the recommendation scenario in Pinterest, and embedding based Item2Item recommendation is a good solution. However, colaborative filtering based embedding and random walk based embedding are not applicable in the scenario of Pinterest:

  1. These embeddings are trained on the whole graph data, and when there are no data added, these embeddings should be re-trained again. However, at Pinterest, every second there are new pins uploaded, which makes it infeasible to re-training these embedding all the times
  2. For these method, in the end, every item will have a corresponding embedding. However, the billions of pins on Pinterest translate to significant storage overhead for these embeddings.

Based on GraphSAGE algorithm, PinSage was developed to address the aforementioned challenges. In the following section, we’ll first explain GraphSAGE then delve into its adaptions in PinSage

GraphSAGE

From GCN to GraphSAGE

Graph Convolutional Network (GCN) is a type of graph neural network, which borrows the core idea of Convolutional Neural Network (CNN) and adapts it to work on graph data:

  1. CNN is mostly used in image data to learn local patterns with the sliding window convolution operation. And GCN adopts similar convolution operation to aggregate feature information from a node’s local graph neighborhood.
  2. Just like CNNs learn filters that activate on certain patterns in images, GCNs learn a function/filter that extracts patterns from a node's neighborhood in the graph

GCN has been proved to be an effective way to both leverage content information and graph structure. However, to apply it to industries with huge graph data, there is a main setback: GCN should be trained on the entire graph data and it aggregates information from all neighbors of a node. This setback makes it infeasible to apply GCN to a graph with billions of nodes and whose stucture is constantly evolving.

Then GraphSAGE was invented to solved the above problem, and the key idea is to learn a function to aggregate feature information from a node’s local neighborhood, then use this function and local neighborhood information to create embedding.

Compared to normal GCN, the main differences are:

  1. Instead of aggrgating information from all neighbors, GraphSAGE uses a sampling strategy to sample a fixed number of neighbors for each node and aggregate their information
  2. GraphSAGE is designed to be an inductive framework which can generalize to unseen nodes in large or dynamic graph. For node embedding, instead of learning distinct embedding for each item, a aggregated function is learnt which can map features to embedding.

Detailed Explanation

The name GraphSAGE (Graph SAmple and aggreGatE) indicates there are two main components

  1. Sample : a method to sample a fixed number of neighbors for a node
  2. Aggregate : a method to aggregate feature information from neighbors
image

To understand GraphSAGE, let’s try to answer three key questions in this section:

  1. How to sample the neighbors of a node ?
  2. Assume GraphSAGE has been trained and model paramters are fixed now, how to get the node embedding ?
  3. How to train GraphSAGE?

How to sample the neighbors of a node ?

For node A from image below, its neighbors are

image
  • 1st-hop neighbor: B, C, D
  • 2nd-hop neighbor: E, F

For a large graph, a node may have too many neighbors, which slows the computation. So sampling strategy is applied to select fixed number of neigherbors from each node. For GraphSAGE, random sampling is applied.

If we set number of neighbor to 1, then for node A, its sampled neighbors might be

  • 1st-hop neighbor: C
  • 2nd-hop neighbor: F

How to get the node embedding (forward propagation) ?

The intuition is that:

  1. There are several layers(or depth) of convolutional layers, each layer aggragates informations the local neighbors
  2. As this process iterates, nodes incrementally gain more and more information from further reaches of the graph

Assume number of layers is KK; input feature of node is xx. Formula below shows how to generate the embedding for node A, namely zAz_A:

image

There are different types of aggregators such as mean aggregator, LSTM aggregator and pooling aggregator.

How to train GraphSAGE (backward propagation)?

To train the weights WkW^k and BKB^K, an unsupervised way is adopted, which is similar to that from DeepWalk.

  1. If two nodes co-occurs consecutively on fixed-length random walk, then their are positive pairs; otherwise they are negative pairs
  2. The loss is the negative sampling loss. The loss function encourage nearby nodes to have similar representations, while enforcing that the representations of disparate nodes are highly distinct
image

PinSage

Although based on GraphSAGE, PinSage introduced several changes to handle the massive scale of Pinterest’s recommendation graph effectively. We’ll mainly focus on these changes in the following section.

Features (Side Information) of Nodes

In terms of the input feature XX of nodes, whilch will also be the embedding in layer 0 h0h^0, in PinSage, following features are used

  1. Visual information: visual embedding generated with deep learning
  2. Textual anonotations (title, description): annotation embedding generated with deep learning
  3. Graph structural information: log degree of the node/pin in the graph

Random-walk Based Neighbor Sampling

Due to the large scale of graph data in Pinterest, every pin have plenty of neighbors. Random sampling strategy from GraphSAGE might lose important information. To solved this issue, importance based neighbor sampling is a better solution.

For a node uu, instead selecting TT item randomly from it’s immediate neighbors, following the step below:

  1. Simulate several random-walks from node uu
  2. Calculate the L1L_1-normalized visit count of nodes visited by the random walk
  3. Top TT nodes with the highest normalized visit counts are defined as neighbors of node uu

(This random-walk method is actually a previous work called Pixie from Pinterest. For more details, please refer to Pixie: A System for Recommending 3+ Billion Items to 200+Million Users in Real-Time)

Importance Based Neighborhood Aggregator

Since importance are considered while sampling neighbors, it’s better also cosider importance while aggregated neighborhood information. The normalized visit counts from the step of random-walk based neighbor sampling are used as weights for each neighbor. Then it becomes an importance based pooling method

image

Max-margin Ranking Loss

Model is trained in an supervised fasion, and max-margin ranking loss is applied.

Assume zqz_q and ziz_i represented the embedding of query item and positive sample, and Pn(q)P_n(q)denotes the distribution of negative sample of query item qq

image

Sampling Negative Samples

As mentioned above, negative sampling is used in the loss function. For PinSage, there are two changes to the traditional sampling strategy.

First, normaly, for each nodes, a fixed number of negative samples should be sampled. To improve efficiecy when training with large batch size, 500 negative samples are sampled and are shared by all training example in each minibatch. Empirically, no performance difference are obseved compared to the tradional way.

Second, there are 2 billion items at Pinterest. Sampling 500 samples from 2 billions item means that these 500 negative samples are most of the time completely different from the query item. It also means that it’s too easy for model to learn these difference and model can’t handle well when negative samples are similar to the query items. To solved this problem, hard negative samples are introduced: these negative should be somewhat related to the query item, but they are still not positive samples. And they are generated by ranking items in a graph according to their Personalized PageRank scores with respect to query item. Items ranked at 2000-5000 are randomly sampled as hard negative items.

(Actually sampling strategy of negative samples is an important topic for candidate generation, we’ll talk more about it in future blogs)

Curriculum Training

Including hard negative sample help models learn better, but it makes model learn slower: it doubled the number of epochs needed for model to converge. To solve this issue, a curriculum training strategy was adopted, whose basic idea is to learn soft negative first, then gradually learn hard negative sample.

  1. In the first epoch of training, no hard negative sample are used, so that model can quickly finds an area in parameter space where the loss is relative low
  2. In seubsequent epoches, hard negative sample are added for each query item. In this case, it helps model to gradually learn how to distinguish highly related pins from only slighted related ones.
  3. Put it differently: At epoch nn of the training, 𝑛1𝑛−1 hard negative items are added to the set of negative items for each item.

Performance

Online test was conducted, and here are the performance:

  1. It’s found that PinSage consistently recommends pins that are more likely to be re-pinned by the user than the alternative methods.
  2. Depending on the particular setting, 10-30% improvements in repin rate over the Annotation and Visual embedding based recommendations has been observed.

Final Thoughts

Although the above two works are from different industries and based on different techniques. We do observed some common methodologies, for example,

  1. Including more side information to improve embedding and solve cold-start problem.
  2. Importance based embedding aggregatation.

Reference