Principle:Recommenders team Recommenders Benchmark Results Compilation
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Benchmarking, Analysis |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Compiling and analyzing benchmark results aggregates metrics and timing data from multiple algorithms into comparison tables for algorithm selection.
Description
After running each algorithm through the prepare-train-predict-evaluate pipeline, the benchmark produces a collection of metric dictionaries and Timer objects for each algorithm. The Results Compilation principle defines how these individual results are aggregated into a single comparison DataFrame that enables side-by-side analysis.
The compilation process:
- Collects timing data (training time, prediction time, recommendation time) from Timer objects.
- Collects rating metrics (RMSE, MAE, R2, Explained Variance) where available, using NaN for algorithms that do not produce rating predictions.
- Collects ranking metrics (MAP, nDCG@k, Precision@k, Recall@k) where available.
- Merges all results into a single DataFrame with one row per (dataset_size, algorithm) combination and columns for all metrics.
This unified results table supports:
- Algorithm comparison: Which algorithm achieves the best ranking/rating metrics?
- Performance analysis: Which algorithms are fastest to train and predict?
- Tradeoff analysis: How do accuracy and speed trade off across algorithms?
- Scalability analysis: How do metrics and timing change across dataset sizes (100K, 1M, 10M, 20M)?
Usage
Use this principle at the end of a benchmark run to compile all individual algorithm results into a comparison table. The compiled DataFrame is the primary artifact for algorithm selection decisions.
Theoretical Basis
The results compilation follows a collect-merge-analyze pattern:
For each (data_size, algorithm) pair:
summary = {
"Data": data_size,
"Algo": algorithm_name,
"K": top_k_value,
"Train time (s)": timer_train.interval,
"Predicting time (s)": timer_rating.interval or NaN,
"Recommending time (s)": timer_ranking.interval or NaN,
"RMSE": rating_metrics["RMSE"] or NaN,
"MAE": rating_metrics["MAE"] or NaN,
"R2": rating_metrics["R2"] or NaN,
"Explained Variance": rating_metrics["Explained Variance"] or NaN,
"MAP": ranking_metrics["MAP"] or NaN,
"nDCG@k": ranking_metrics["nDCG@k"] or NaN,
"Precision@k": ranking_metrics["Precision@k"] or NaN,
"Recall@k": ranking_metrics["Recall@k"] or NaN,
}
Append summary as a row to df_results
Algorithms that do not support a particular metric type (e.g., SAR does not produce rating predictions) have NaN in the corresponding columns. This design choice keeps the table uniform and avoids separate tables for different metric subsets.