Remove Position Bias in Recommendation System

Remove Position Bias in Recommendation System

Tags
Data Science
Date Published
February 25, 2024

Introduction

What is Position Bias ?

When checking data from recommendation system, we can easily find that items with better positions tend to have high conversion rate, which is mainly caused by two reasons:

  1. Items with better position are more relevant to users’ preference
  2. Influence of position: users believe items with better position are more relevant; items with better positions have more chance to be checked

Although the collected data reflects the combined effects from these two reasons, while doing analysis and developing models, what we really care about is the first one. The second becomes a bias (position bias) that will harm our decision-making. So it’s critical to remove this bias for analysis and model training.

CVR-position relationship from
CVR-position relationship from Uber Eat traffic where a random recommendation engine was applied

Position is not the Only One

Position is not the only one that causes position bias . In the end, the magnitude of position bias is decided by user interface and layouts. Except position feature, different operating system, device, and how position is perceived also influence it.

Example from
Example from Uber Eat on random traffic

Assumption & Approach

Problem Definition

From the my prvious blog about Evolution of Recommendation Models, the recommendation problem is defined as: Given context c , user u , and item i, we need to get

P(relevance=1)=P(conversion=1)=f(u,i,c)P(relevance=1)=P(conversion=1)=f(u, i, c)

Now due the position bias introduced by position p, the problem becomes:

P(conversion=1)=f(p,P(relevance=1))=f(u,i,c,p)P(conversion=1)= f(p, P(relevance=1))=f(u, i, c, p)

The problem is: the data reflects P(conversion=1)P(conversion=1), however, what we want is P(relevance=1)P(relevance=1)

Simplify the Problem Definition

P(conversion=1)P(conversion=1) can be decomposed into two factors

  1. The probabilty that the item is relevant to the user
  2. The probability that this item is examined by user given it’s at position p. And this probability only depends on position related features

So, we have

P(conversion=1)=P(relevance=1i,c,u)P(exmained=1p)P(conversion=1) = P(relevance=1|i,c,u)*P(exmained=1|p)

What we want is P(relevance=1i,c,u)P(relevance=1 | i, c, u)

Two High Level Approaches

There are two high level approaches:

  1. Change data collection step: sort the result randomly, then overall, every item have the same propablity to be examined
  2. Change data interpretation step: find a way to interprete data to minimize the bias effect

Both the above approaches are used in industry, but approach 1 is harmful to ranking performance and user experience, and it’s also more expensive. So approach two is more popular and we will mainly focus on this approach in the rest of the blog

Digging into the Second Approach

We’ll check three methods, and currently the third one is used more in industry.

Method 1: Counterfactual Learning-to-rank with Inverse Propensity Scoring(IPS)

In the domain of casual inference, there is term called propensity score, which means the conditional probability of receiving a treatment. It’s commonly used in observation studies where random assignment of treatment group is not possible.

In the setting of recommendation, every items are actually treatments. We want to know the outcome when users recieve these treatments. Based on the problem definition below, P(examined=1p)P(examined=1|p) is actually propensity score

P(conversion=1)=P(relevance=1i,c,u)P(examined=1p)P(conversion=1) = P(relevance=1|i,c,u)*P(examined=1|p)

Then here is solution from casual inference:

  1. Step 1: estimate propensity score: a propensity model to predict P(examined=1|p)
  2. Step 2: Inverse Propensity Scoring: weight each training samples with the inverse of the predicted propensity

Method 2: Features Representing Debiased Conversion Performance

For example Click Over Expect Click (COEC) which represents debiased conversion performance. These method is mainly used in traditional machine learning models

Click Over Expected Click (COEC) feature

Method 3: Learn and Remove Position Bias with Deep Learning

Basic Idea

Let’s go back to the problem definition:

P(conversion=1)=P(relevance=1i,c,u)P(exmained=1p)P(conversion=1) = P(relevance=1|i,c,u)*P(exmained=1|p)

The idea is that if we can feed position_feature into the model to learn P(relevance=1i,c,u)P(relevance=1|i,c,u) and P(exmained=1p)P(exmained=1|p) independently. Then we can remove position bias by

  1. Only using P(relevance=1i,c,u)P(relevance=1|i,c,u) to predict relevance
  2. Or set position to a fixed number, so that P(exmained=1p)P(exmained=1|p) becomes a fixed number. Then P(conversion=1)P(conversion=1) only depends on P(relevance=1i,c,u)P(relevance=1|i,c,u)

Why Doesn’t Traditional Machine Learning Model Work?

Long story short: it’s hard to make sure P(relevance=1i,c,u)P(relevance=1|i,c,u) is independent with position features. Take tree-based model as an example, there are so many interactions between all features(including position feature) within the model, then P(relevance) becomes dependent on postion features

(Technically, this goal can be achieved with tree-based model but not that easy. please check Personalized Click Prediction in Sponsored Search paper from Yahoo, where position_feature is treated as leaf nodes of the last tree from GBDT model; Logistic regression can also achieve this goal since every variable contibutes independly to the finally result. But LR is actually a one-layer deep learning model, right? 😀)

How Does Deep Learning Solve it ?

Architecture of two independent towers:

  1. Relevance tower learns P(relevance=1i,c,u)P(relevance=1|i,c,u)
  2. Bias tower learns P(exmained=1p)P(exmained=1|p)
image

Industry Practices

Below are the deep learning practices from industry

Huawei, 2019
Aribnb, 2020
Uber Eat, 2023
Doordash, 2023
Google, 2023

Regularization for Bias Tower in Deep Learing

Problems & Solution

There are two problems if without regularization

  1. Model can rely too much on bias tower
  2. Bias tower can learn relevance, which ideally should be learned by relevance tower.

In both cases, it will make bias tower more powerful and relevance tower weaker. However, bias tower will be completely ignored while doing predictions. So a weak relecance tower is not what we want.

So regulariztion techniques are needed:

  1. Dropout at the end of bias tower
  2. L1 regularization to the weights from bias tower

Then, What’s the Optimal Parameters for Regularization ?

We want to acheve two opposite goals:

  1. Maximizing the learning from bias tower: the more it learns the more bias can be removed
  2. Maximizing the learning from relevance tower: the more it learns the more accurate relevance prediction will be

There is a detailed example from Airbnb, where NDCG is the evaluation metric

  1. Set position feature to a fixed value 0, and calculate the metric. Since position tower is ignored, we get NDCGrelevanceNDCG_{relevance}
  2. Keep the original position feature, and calculate the metirc. We get NDCGrelevance+biasNDCG_{relevance + bias}
  3. By subtracting the above two metrics, we get NDCGbiasNDCG_{bias}
  4. Plot the relationship curve between NDCGrelevanceNDCG_{relevance} and NDCGbiasNDCG_{bias} as the plot below
  5. Find a point where NDCGbiasNDCG_{bias} is sufficiently advance without causing too much drop of NDCGrelevanceNDCG_{relevance}, then find the dropout rate of that point
image

Reference