- Introduction
- About this year’s competition
- Overview
- Data Info
- Result Expected & Evaluation Metric
- Solutions in Details
- Overview
- Data Split
- Training & Features for Ranking
- Downsampling
- Features & Importance
- Performance in Details
- About Task 3
- Methods which are worth to try (but didn’t have time to try)
- Reference
Introduction
Knowledge Discovery and Data Mining (KDD) is a yearly conference. Before the conference, there will be a data science competition, which will end before the conference. During the conference, there will be workshop, where the winner will share their solution.
In my spare time, I spent some time on this year’s competition, and the rank is not that bad: 17 out of 450 teams. In this doc, I’ll share some of my solutions and learnings.
About this year’s competition
Overview
This year the DS competition was held by Amazon: Amazon KDD Cup 2023 - Build Multilingual Recommendation System. During the competition, the shopping session dataset is provided by Amazon, and you’ll need to finish three tasks based on this data. There are leaderboard for each task and the final rank is decided by combined these three leaderboard. Below are these 3 tasks
- Task 1: predicting the next engaged product for sessions from English, German, and Japanese
- Task 2: predicting the next engaged product for sessions from French, Italian, and Spanish, where transfer learning techniques are encouraged
- Task 3: predicting the title for the next engaged product
Data Info
There are two categories of data
Result Expected & Evaluation Metric
Result Expected | Evaluation Metrics | Example | |
Task 1 | For each session, the participant should predict 100 product IDs that are most likely to be engaged, based on historical engagements in the session | Mean Reciprocal Rank (MRR) (the basic idea is similar to NDCG) | |
Task 2 | |||
Task 3 | predict the title of the next product that a customer will engage with, based on their session data | bilingual evaluation understudy (BLEU) |
Solutions in Details
Codes are here and final performance are as follows:
Task 1 - MRR@100 | Task 2 - MRR@100 | Task 3 - BLUE | |
1st place solution | 0.41188 | 0.46845 | 0.27152 |
my solution | 0.35822 | 0.40144 | 0.26553 |
Overview
The solution is a two-stage recommendation
- Stage 1: multi-strategy candidate generation
- Goal: retrieve as more ground truth as possible
- Evaluation metrics: recall@20, recall@100, recall@all, MRR@100
- Stage 2: ranking
- Goal: sorted the item as accurate as possible
- Evaluation metrics: MRR@100
Items in the dashed squares were not included in the final solution due to
- Computation was so slow for strategy 6 and 7
- Performance was not good for strategy 8
- DNN was not finished before the end of the competition
Data Split
The following data are used for each stage:
- Candidate generation,
- Model training
- train data part 1
- train data part 2, excluding ground truth of each session
- eval, excluding ground truth of each session
- test data (we don’t have ground truth)
- Performance evaluation
- eval data
- Ranking
- Model training
- Train data part 2
- Performance evaluation
- eval data
Training & Features for Ranking
Downsampling
After multi-strategy candidate generation, due to huge amount of negative samples, downsampling is applied before training a ranker
- Remove all the session where recall@all=0 in our candidate generation stage
- For each session_id, if a negative sample is retrieved by more strategies, this sample will have a lower probability to be removed.
Features & Importance
Mainly 3 categories of feature in ranking
- Features from candidate generation stage, e.g count statistics, item similarity
- next_item_normalized_weight
- nicnic_weight
- strategy_num
- ….
- Price feature
- price_mean_norm
- price_max_norm
- ….
- Item embedding from Word2Vector
- vec_1 to vec_32
Performance in Details
Task 2 is used as an example to show the performance of each stage. And model is evaluated on the eval data
Stage 1: Candidate generation | ||||
Strategy No. | Average recommended Items per session | Recall@20 | Recall@100 | MRR@100 |
1 | 83 | 0.2385 | 0.4022 | 0.1042 |
2 | 21 | 0.4536 | 0.4742 | 0.2605 |
3 | 48 | 0.5176 | 0.567 | 0.2709 |
4 | 104 | 0.5621 | 0.6302 | 0.3210 |
5 | 34 | 0.5327 | 0.5836 | 0.2411 |
Combine the above 5 strategy | 150 | Recall@all: 0.7179 |
Stage 2: Ranking | ||||
Model | Average recommended Items per session | MRR@100 | Recall@20 | Recall@100 |
LGBMRanker | 94 | 0.376 | 0.6317 | 0.7114 |
About Task 3
Usually, the competition host expects better solution from the participant. For task 3, I don’t think Amazon Search team will get what they expected, because if you submitted the last item title of each session, you got BLUE score of 0.26. (And that’s what I did 🤣 ) Since the BLUE score of 1st place is only 0.27, I don’t think they will provide some fancier solution.
Actually initially, I tried to re-use the solution of task 1 and 2: predict the next item, then get the title of this item from the product data. And I got BLUE score of 0.14. Then I realized the solution that I mentioned above and tried it. Then the result seems not bad, and I stopped working on this task after that 🤣 🤣
Methods which are worth to try (but didn’t have time to try)
- Candidate generation
- Recall@all is the upper boundary of the ranker’s performance, which is 0.7 for the current version. More candidate generation strategies will be needed to increase this value
- Make ALS item similarity and Word2Vector item sequence similarity work
- More co-presence candidate generation
- Ranking
- More powerful ranker such as DNN which can
- include more item embedding features
- utilize text feature such as product title feature which could be embedded with pre-trained BERT
- More feature based on product info. Currently the mainly product info are price info
- Try more downsampling strategy
- Model stacking & result blending
- This is a common strategy in DS competition, which I didn’t have time to try
- Transfer learning for task2
- In the current version, I basically copy the steps from task1 to task2. The only transfer learning that I used is to train a unified Word2Vector model with all the data from task1 and task2. More transfer learning might be helpful, such as training the ranker on data of task1 first, then fine-tune it on the data of task2
- For task 3, I don’t have so much idea now 🤣