Principle:Recommenders team Recommenders Benchmark Prediction Generation
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Benchmarking, Prediction |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Standardized prediction and top-K recommendation generation across different algorithms for benchmarking, accounting for the fact that some algorithms produce rating predictions, others produce ranked lists, and some produce both.
Description
Recommendation algorithms differ fundamentally in their output capabilities. Some algorithms (ALS, SVD, EmbeddingDotBias) can produce rating predictions -- estimating the score a user would give to a specific item. Other algorithms (SAR, NCF, BPR, BiVAE, LightGCN) produce ranked recommendation lists -- ordered sets of items predicted to be most relevant. Some algorithms support both.
The Benchmark Prediction Generation principle separates these two output modes into distinct function families:
- predict_* functions: Generate rating predictions for known user-item pairs (from the test set). Used for computing rating metrics (RMSE, MAE, R2, Explained Variance).
- recommend_k_* functions: Generate top-K ranked recommendation lists for all users. Used for computing ranking metrics (MAP, nDCG@k, Precision@k, Recall@k).
Both function families:
- Accept the trained model and relevant data (test set, training set for seen-item removal).
- Wrap the prediction/recommendation call in a Timer context manager.
- Return a (results, Timer) tuple for consistent metric collection.
Not every algorithm has both a predict_* and recommend_k_* function. The benchmark tracks which algorithms support which metric types through a configuration dictionary:
- Rating-capable: ALS, SVD, EmbeddingDotBias
- Ranking-capable: All eight algorithms
Usage
Use this principle when benchmarking algorithms that produce different types of outputs. The separation into predict_* and recommend_k_* allows the benchmark loop to conditionally call the appropriate function based on each algorithm's capabilities.
Theoretical Basis
The two prediction modes correspond to different evaluation paradigms:
Rating Prediction:
predict_a(model, test) -> (predictions, timer)
For each (user, item) pair in test set:
prediction = model.predict(user, item)
Output: DataFrame with columns [userID, itemID, prediction]
Evaluation: RMSE, MAE, R2, Explained Variance
Top-K Recommendation:
recommend_k_a(model, test, train, top_k, remove_seen) -> (recs, timer)
For each user:
candidates = all_items - seen_items (if remove_seen=True)
scores = model.score(user, candidates)
recs = top_k(scores)
Output: DataFrame with columns [userID, itemID, prediction]
Evaluation: MAP, nDCG@k, Precision@k, Recall@k
The remove_seen parameter (defaulting to True) ensures that items already in the training set are excluded from recommendations, preventing trivial recommendations of already-known items. The implementation of seen-item removal varies by algorithm (e.g., SQL outer join for ALS, pandas merge for NCF, built-in parameter for SAR).