Amazon KDD Cup 2023 - Build Multilingual Recommendation System

Amazon KDD Cup 2023 - Build Multilingual Recommendation System

Tags
Data Science
Date Published
June 10, 2023

Introduction

Knowledge Discovery and Data Mining (KDD) is a yearly conference. Before the conference, there will be a data science competition, which will end before the conference. During the conference, there will be workshop, where the winner will share their solution.

In my spare time, I spent some time on this year’s competition, and the rank is not that bad: 17 out of 450 teams. In this doc, I’ll share some of my solutions and learnings.

About this year’s competition

Overview

This year the DS competition was held by Amazon: Amazon KDD Cup 2023 - Build Multilingual Recommendation System. During the competition, the shopping session dataset is provided by Amazon, and you’ll need to finish three tasks based on this data. There are leaderboard for each task and the final rank is decided by combined these three leaderboard. Below are these 3 tasks

  1. Task 1: predicting the next engaged product for sessions from English, German, and Japanese
  2. Task 2: predicting the next engaged product for sessions from French, Italian, and Spanish, where transfer learning techniques are encouraged
  3. Task 3: predicting the title for the next engaged product

Data Info

There are two categories of data

Shopping Session Data
Production Information Data

Result Expected & Evaluation Metric

Result Expected
Evaluation Metrics
Example
Task 1
For each session, the participant should predict 100 product IDs that are most likely to be engaged, based on historical engagements in the session
Mean Reciprocal Rank (MRR) (the basic idea is similar to NDCG)
Task 2
Task 3
predict the title of the next product that a customer will engage with, based on their session data
bilingual evaluation understudy (BLEU)
Task 1, 2 example
Task 3 example

Solutions in Details

Codes are here and final performance are as follows:

Task 1 - MRR@100
Task 2 - MRR@100
Task 3 - BLUE
1st place solution
0.41188
0.46845
0.27152
my solution
0.35822
0.40144
0.26553

Overview

The solution is a two-stage recommendation

  1. Stage 1: multi-strategy candidate generation
    1. Goal: retrieve as more ground truth as possible
    2. Evaluation metrics: recall@20, recall@100, recall@all, MRR@100
  2. Stage 2: ranking
    1. Goal: sorted the item as accurate as possible
    2. Evaluation metrics: MRR@100
image

Items in the dashed squares were not included in the final solution due to

  1. Computation was so slow for strategy 6 and 7
  2. Performance was not good for strategy 8
  3. DNN was not finished before the end of the competition

Data Split

The following data are used for each stage:

  1. Candidate generation,
    1. Model training
      1. train data part 1
      2. train data part 2, excluding ground truth of each session
      3. eval, excluding ground truth of each session
      4. test data (we don’t have ground truth)
    2. Performance evaluation
      1. eval data
  2. Ranking
    1. Model training
      1. Train data part 2
    2. Performance evaluation
      1. eval data
image

Training & Features for Ranking

Downsampling

After multi-strategy candidate generation, due to huge amount of negative samples, downsampling is applied before training a ranker

  1. Remove all the session where recall@all=0 in our candidate generation stage
  2. For each session_id, if a negative sample is retrieved by more strategies, this sample will have a lower probability to be removed.

Features & Importance

Mainly 3 categories of feature in ranking

  1. Features from candidate generation stage, e.g count statistics, item similarity
    1. next_item_normalized_weight
    2. nicnic_weight
    3. strategy_num
    4. ….
  2. Price feature
    1. price_mean_norm
    2. price_max_norm
    3. ….
  3. Item embedding from Word2Vector
    1. vec_1 to vec_32
Feature importance

Performance in Details

Task 2 is used as an example to show the performance of each stage. And model is evaluated on the eval data

Stage 1: Candidate generation
Strategy No.
Average recommended Items per session
Recall@20
Recall@100
MRR@100
1
83
0.2385
0.4022
0.1042
2
21
0.4536
0.4742
0.2605
3
48
0.5176
0.567
0.2709
4
104
0.5621
0.6302
0.3210
5
34
0.5327
0.5836
0.2411
Combine the above 5 strategy
150
Recall@all: 0.7179
Stage 2: Ranking
Model
Average recommended Items per session
MRR@100
Recall@20
Recall@100
LGBMRanker
94
0.376
0.6317
0.7114

About Task 3

Usually, the competition host expects better solution from the participant. For task 3, I don’t think Amazon Search team will get what they expected, because if you submitted the last item title of each session, you got BLUE score of 0.26. (And that’s what I did 🤣 ) Since the BLUE score of 1st place is only 0.27, I don’t think they will provide some fancier solution.

Actually initially, I tried to re-use the solution of task 1 and 2: predict the next item, then get the title of this item from the product data. And I got BLUE score of 0.14. Then I realized the solution that I mentioned above and tried it. Then the result seems not bad, and I stopped working on this task after that 🤣 🤣

Methods which are worth to try (but didn’t have time to try)

  1. Candidate generation
    1. Recall@all is the upper boundary of the ranker’s performance, which is 0.7 for the current version. More candidate generation strategies will be needed to increase this value
      1. Make ALS item similarity and Word2Vector item sequence similarity work
      2. More co-presence candidate generation
  2. Ranking
    1. More powerful ranker such as DNN which can
      1. include more item embedding features
      2. utilize text feature such as product title feature which could be embedded with pre-trained BERT
    2. More feature based on product info. Currently the mainly product info are price info
    3. Try more downsampling strategy
  3. Model stacking & result blending
    1. This is a common strategy in DS competition, which I didn’t have time to try
  4. Transfer learning for task2
    1. In the current version, I basically copy the steps from task1 to task2. The only transfer learning that I used is to train a unified Word2Vector model with all the data from task1 and task2. More transfer learning might be helpful, such as training the ranker on data of task1 first, then fine-tune it on the data of task2
  5. For task 3, I don’t have so much idea now 🤣

Reference