Overcoming bias
in search and recommendations
metarank.ai | github.com/metarank/metarank | Grebennikov Roman | 2022
This is me
- Long ago: PhD in CS, quant trading, credit scoring
- Past: Search & personalization for ~7 years
- Now:
Unemployed Full-time open-source contributor
Metarank
a swiss army knife of re-ranking
Ranking around us
Sort by # of clicks
data:image/s3,"s3://crabby-images/ca3f3/ca3f3908058f7ce088a0945f7f4157a90d6ca7e8" alt=""
- Pros: easy to implement
- Cons: new items are never on top
Self-degrading ranking
data:image/s3,"s3://crabby-images/9198d/9198d1f73fce17d71835c221039f25dde1fd1903" alt=""
- People are lazy: top items get more clicks
- Popular items become even more popular
Behavioral biases
- Position: top items clicked more often
- Presentation: grid/pagination affects click probability
- Popularity: snickers vs no-name chocolate
- Model: train ML model on it's own output
Position bias
data:image/s3,"s3://crabby-images/0f018/0f0183352aaf516b46b469aed8b91c6f100414ec" alt=""
- We click on first items because they're on top
- Click on #1 - is it relevant or just first?
Experiment: movie search
- top-10k TMDB movies
- Top-30k movie-related queries from Google
- Crowdsourced with toloka.ai
- 650k labels: 8% relevant
Shooting yourself in the foot
- Toloka has unbiased "search relevance" template
- Made it biased with explicit item ordering
The trick: shuffled results
- Top-24 results by BM25 score, but randomly ranked
- relevance should be independent from position *
data:image/s3,"s3://crabby-images/f2c80/f2c8089eb4d7807a79275afb27ffd0304cbdf4a8" alt=""
* - after outlier removal
Not only position
data:image/s3,"s3://crabby-images/95b14/95b14b2bc96493ed51971a39b92295f30085595b" alt=""
- presentation affects clicks
Presentation bias
- click % drop on a second row
Bias & search function
- Movie search: navigational function => low bias
- Ecommerce: discovery function => high bias
data:image/s3,"s3://crabby-images/43820/43820614a2e531bf32197ffcbf4d8a755f7889c3" alt=""
Not so independent
In practice you observe relevance + bias together
Learning-to-Rank 101
data:image/s3,"s3://crabby-images/8ac7b/8ac7baa9e9d92c666bac75424a7c329dfbe144f1" alt=""
- Implicit feedback: clicks on items
- Item metadata as ranking factors
- Loss function: pairwise, NDCG
Not all clicks were made equal
data:image/s3,"s3://crabby-images/84170/841709be271a42c5d2dbffcb812a0cb2d20b532f" alt=""
Clicks as relevance labels?
data:image/s3,"s3://crabby-images/1c0ea/1c0eaeaa459ab09c75971b0e19c0407a8b8a872b" alt=""
- if observed_label = bias * true_label
- then true_label = observed_label / bias
data:image/s3,"s3://crabby-images/6698f/6698f4454b4480dcff2cc8eac441b8259a7421fa" alt=""
IPW: Inverse Propensity Weighting
data:image/s3,"s3://crabby-images/7064d/7064dce225a24ffc8ea8c03395d0fcb68a3659d9" alt=""
- How can we estimate the bias?
IPW in 10 seconds
- Shuffle ranking for small % of traffic
- Estimate the bias
- De-bias the remaining data
data:image/s3,"s3://crabby-images/526d8/526d853dcdb1115e4331ad21c5cbf3bab53ec7e7" alt=""
Estimating the bias
data:image/s3,"s3://crabby-images/6cb00/6cb00539498e922e70e1d2e0454f331ac5564dc4" alt=""
Less costly IPW
- Top-N shuffle: only top-3 positions are affected
- Pairwise swaps: over all traffic
- Multi-ranker: exploit ongoing a/b tests
ML model to predict bias?
- Only biased ranking factors: popularity, cost, position
- Predicted click probability == bias
data:image/s3,"s3://crabby-images/a236a/a236ad34cc32e12639b179855943404a60a16226" alt=""
data:image/s3,"s3://crabby-images/3234e/3234ed0d9e971d8ec08fc636496b0523e2448850" alt=""
- Weighting: prefer bottom-position clicks
- Sampling: drop some top-position clicks
A story of IPW
data:image/s3,"s3://crabby-images/ac2bb/ac2bb6303043970bcdbcddd4cee0b58712bac19d" alt=""
|
- "Too many unpopular items on top"
- IPW ranking = inverse of non-IPW ranking
- Training data is popularity biased
- Low popularity = ML's high reward
|
Bias is context dependent, hard to estimate precisely
Bias-aware ML
Can we learn the relevance AND bias influence at once?
data:image/s3,"s3://crabby-images/ba637/ba63740617d20d77949b210634ad5841ba664912" alt=""
Bias-aware ranking
data:image/s3,"s3://crabby-images/f45b2/f45b27bd2388ec042d1fa8d1e4e3e6d913ae0753" alt=""
- Training: Use biased ranking factors as-is
- Inference: Replace these with constants
Ranklens dataset
data:image/s3,"s3://crabby-images/4b783/4b7830a6877e0b4a5e2dd707c732f21bdc46a7f3" alt=""
- Also crowd-sourced with toloka.ai
- ~3k people labelled favourite movies in ~100 categories
Metarank
data:image/s3,"s3://crabby-images/64b56/64b56a1c39fac2e9193d2499a76224c52943e5df" alt=""
- Take a stream of historical/realtime events
- Re-rank top-N candidates for better NDCG
Why PAL?
- Shuffling: complicated and costly
- Can learn context
- Can be adapted to popularity/presentation biases
Results
- Biased: NDCG=0.6002
- De-biased: NDCG=0.6060, +1%
data:image/s3,"s3://crabby-images/526d1/526d167dd3cc9652e7d21093283b3e2608d56630" alt=""
Should you de-bias?
- Navigation vs discovery: it depends
- There are better algos than PAL
Extra
data:image/s3,"s3://crabby-images/46544/46544c844f4f6f2cbc6ab70f2f67429f47f972c5" alt=""