- Introduction
- What is Position Bias ?
- Position is not the Only One
- Assumption & Approach
- Problem Definition
- Simplify the Problem Definition
- Two High Level Approaches
- Digging into the Second Approach
- Method 1: Counterfactual Learning-to-rank with Inverse Propensity Scoring(IPS)
- Method 2: Features Representing Debiased Conversion Performance
- Method 3: Learn and Remove Position Bias with Deep Learning
- Basic Idea
- Why Doesn’t Traditional Machine Learning Model Work?
- How Does Deep Learning Solve it ?
- Industry Practices
- Regularization for Bias Tower in Deep Learing
- Problems & Solution
- Then, What’s the Optimal Parameters for Regularization ?
- Reference
Introduction
What is Position Bias ?
When checking data from recommendation system, we can easily find that items with better positions tend to have high conversion rate, which is mainly caused by two reasons:
- Items with better position are more relevant to users’ preference
- Influence of position: users believe items with better position are more relevant; items with better positions have more chance to be checked
Although the collected data reflects the combined effects from these two reasons, while doing analysis and developing models, what we really care about is the first one. The second becomes a bias (position bias
) that will harm our decision-making. So it’s critical to remove this bias for analysis and model training.
Position
is not the Only One
Position
is not the only one that causes position bias
. In the end, the magnitude of position bias is decided by user interface and layouts. Except position feature, different operating system, device, and how position is perceived also influence it.
Assumption & Approach
Problem Definition
From the my prvious blog about Evolution of Recommendation Models, the recommendation problem is defined as: Given context c , user u , and item i, we need to get
Now due the position bias introduced by position p, the problem becomes:
The problem is: the data reflects , however, what we want is
Simplify the Problem Definition
can be decomposed into two factors
- The probabilty that the item is relevant to the user
- The probability that this item is examined by user given it’s at position p. And this probability only depends on position related features
So, we have
What we want is
Two High Level Approaches
There are two high level approaches:
- Change data collection step: sort the result randomly, then overall, every item have the same propablity to be examined
- Change data interpretation step: find a way to interprete data to minimize the bias effect
Both the above approaches are used in industry, but approach 1 is harmful to ranking performance and user experience, and it’s also more expensive. So approach two is more popular and we will mainly focus on this approach in the rest of the blog
Digging into the Second Approach
We’ll check three methods, and currently the third one is used more in industry.
Method 1: Counterfactual Learning-to-rank with Inverse Propensity Scoring(IPS)
In the domain of casual inference, there is term called propensity score, which means the conditional probability of receiving a treatment. It’s commonly used in observation studies where random assignment of treatment group is not possible.
In the setting of recommendation, every items are actually treatments. We want to know the outcome
when users recieve these treatments. Based on the problem definition below, is actually propensity score
Then here is solution from casual inference:
- Step 1: estimate propensity score: a propensity model to predict P(examined=1|p)
- Step 2: Inverse Propensity Scoring: weight each training samples with the inverse of the predicted propensity
Method 2: Features Representing Debiased Conversion Performance
For example Click Over Expect Click (COEC) which represents debiased conversion performance. These method is mainly used in traditional machine learning models
Method 3: Learn and Remove Position Bias with Deep Learning
Basic Idea
Let’s go back to the problem definition:
The idea is that if we can feed position_feature into the model to learn and independently. Then we can remove position bias by
- Only using to predict relevance
- Or set position to a fixed number, so that becomes a fixed number. Then only depends on
Why Doesn’t Traditional Machine Learning Model Work?
Long story short: it’s hard to make sure is independent with position features. Take tree-based model as an example, there are so many interactions between all features(including position feature) within the model, then P(relevance) becomes dependent on postion features
(Technically, this goal can be achieved with tree-based model but not that easy. please check Personalized Click Prediction in Sponsored Search
paper from Yahoo, where position_feature is treated as leaf nodes of the last tree from GBDT model; Logistic regression can also achieve this goal since every variable contibutes independly to the finally result. But LR is actually a one-layer deep learning model, right? 😀)
How Does Deep Learning Solve it ?
Architecture of two independent towers:
- Relevance tower learns
- Bias tower learns
Industry Practices
Below are the deep learning practices from industry
Regularization for Bias Tower in Deep Learing
Problems & Solution
There are two problems if without regularization
- Model can rely too much on bias tower
- Bias tower can learn relevance, which ideally should be learned by relevance tower.
In both cases, it will make bias tower more powerful and relevance tower weaker. However, bias tower will be completely ignored while doing predictions. So a weak relecance tower is not what we want.
So regulariztion techniques are needed:
- Dropout at the end of bias tower
- L1 regularization to the weights from bias tower
Then, What’s the Optimal Parameters for Regularization ?
We want to acheve two opposite goals:
- Maximizing the learning from bias tower: the more it learns the more bias can be removed
- Maximizing the learning from relevance tower: the more it learns the more accurate relevance prediction will be
There is a detailed example from Airbnb, where NDCG is the evaluation metric
- Set position feature to a fixed value 0, and calculate the metric. Since position tower is ignored, we get
- Keep the original position feature, and calculate the metirc. We get
- By subtracting the above two metrics, we get
- Plot the relationship curve between and as the plot below
- Find a point where is sufficiently advance without causing too much drop of , then find the dropout rate of that point