Introduction

Knowledge Discovery and Data Mining (KDD) is a yearly conference. Before the conference, there will be a data science competition, which will end before the conference. During the conference, there will be workshop, where the winner will share their solution.

In my spare time, I spent some time on this year’s competition, and the rank is not that bad: 17 out of 450 teams. In this doc, I’ll share some of my solutions and learnings.

About this year’s competition

Overview

This year the DS competition was held by Amazon: Amazon KDD Cup 2023 - Build Multilingual Recommendation System. During the competition, the shopping session dataset is provided by Amazon, and you’ll need to finish three tasks based on this data. There are leaderboard for each task and the final rank is decided by combined these three leaderboard. Below are these 3 tasks

Task 1: predicting the next engaged product for sessions from English, German, and Japanese
Task 2: predicting the next engaged product for sessions from French, Italian, and Spanish, where transfer learning techniques are encouraged
Task 3: predicting the title for the next engaged product

Data Info

There are two categories of data

‣

Shopping Session Data

‣

Production Information Data

Result Expected & Evaluation Metric

	Result Expected	Evaluation Metrics	Example
Task 1	For each session, the participant should predict 100 product IDs that are most likely to be engaged, based on historical engagements in the session	Mean Reciprocal Rank (MRR) (the basic idea is similar to NDCG)
Task 2
Task 3	predict the title of the next product that a customer will engage with, based on their session data	bilingual evaluation understudy (BLEU)

‣

Task 1, 2 example

‣

Task 3 example

Solutions in Details

Codes are here and final performance are as follows:

	Task 1 - MRR@100	Task 2 - MRR@100	Task 3 - BLUE
1st place solution	0.41188	0.46845	0.27152
my solution	0.35822	0.40144	0.26553

Overview

The solution is a two-stage recommendation

Stage 1: multi-strategy candidate generation

Goal: retrieve as more ground truth as possible
Evaluation metrics: recall@20, recall@100, recall@all, MRR@100

Stage 2: ranking

Goal: sorted the item as accurate as possible
Evaluation metrics: MRR@100

Items in the dashed squares were not included in the final solution due to

Computation was so slow for strategy 6 and 7
Performance was not good for strategy 8
DNN was not finished before the end of the competition

Data Split

The following data are used for each stage:

Candidate generation,

Model training

train data part 1
train data part 2, excluding ground truth of each session
eval, excluding ground truth of each session
test data (we don’t have ground truth)

Performance evaluation

eval data

Ranking

Model training

Train data part 2

Performance evaluation

eval data

Training & Features for Ranking

Downsampling

After multi-strategy candidate generation, due to huge amount of negative samples, downsampling is applied before training a ranker

Remove all the session where recall@all=0 in our candidate generation stage
For each session_id, if a negative sample is retrieved by more strategies, this sample will have a lower probability to be removed.

Features & Importance

Mainly 3 categories of feature in ranking

Features from candidate generation stage, e.g count statistics, item similarity

next_item_normalized_weight
nicnic_weight
strategy_num
….

Price feature

price_mean_norm
price_max_norm
….

Item embedding from Word2Vector

vec_1 to vec_32

‣

Feature importance

Performance in Details

Task 2 is used as an example to show the performance of each stage. And model is evaluated on the eval data

Stage 1: Candidate generation
Strategy No.	Average recommended Items per session	Recall@20	Recall@100	MRR@100
1	83	0.2385	0.4022	0.1042
2	21	0.4536	0.4742	0.2605
3	48	0.5176	0.567	0.2709
4	104	0.5621	0.6302	0.3210
5	34	0.5327	0.5836	0.2411
Combine the above 5 strategy	150	Recall@all: 0.7179

Stage 2: Ranking
Model	Average recommended Items per session	MRR@100	Recall@20	Recall@100
LGBMRanker	94	0.376	0.6317	0.7114

About Task 3

Usually, the competition host expects better solution from the participant. For task 3, I don’t think Amazon Search team will get what they expected, because if you submitted the last item title of each session, you got BLUE score of 0.26. (And that’s what I did 🤣 ) Since the BLUE score of 1st place is only 0.27, I don’t think they will provide some fancier solution.

Actually initially, I tried to re-use the solution of task 1 and 2: predict the next item, then get the title of this item from the product data. And I got BLUE score of 0.14. Then I realized the solution that I mentioned above and tried it. Then the result seems not bad, and I stopped working on this task after that 🤣 🤣

Methods which are worth to try (but didn’t have time to try)

Candidate generation

Recall@all is the upper boundary of the ranker’s performance, which is 0.7 for the current version. More candidate generation strategies will be needed to increase this value

Make ALS item similarity and Word2Vector item sequence similarity work
More co-presence candidate generation

Ranking

More powerful ranker such as DNN which can

include more item embedding features
utilize text feature such as product title feature which could be embedded with pre-trained BERT

More feature based on product info. Currently the mainly product info are price info
Try more downsampling strategy

Model stacking & result blending

This is a common strategy in DS competition, which I didn’t have time to try

Transfer learning for task2

In the current version, I basically copy the steps from task1 to task2. The only transfer learning that I used is to train a unified Word2Vector model with all the data from task1 and task2. More transfer learning might be helpful, such as training the ranker on data of task1 first, then fine-tune it on the data of task2

For task 3, I don’t have so much idea now 🤣

‣

Amazon KDD Cup 2023 - Build Multilingual Recommendation System

Introduction

About this year’s competition

Overview

Data Info

Result Expected & Evaluation Metric

Solutions in Details

Overview

Data Split

Training & Features for Ranking

Downsampling

Features & Importance

Performance in Details

About Task 3

Methods which are worth to try (but didn’t have time to try)

Reference