Principle:Recommenders team Recommenders Benchmark Results Compilation

Knowledge Sources	Recommenders movielens.ipynb
Domains	Recommender Systems, Benchmarking, Analysis
Last Updated	2026-02-10 00:00 GMT

Overview

Compiling and analyzing benchmark results aggregates metrics and timing data from multiple algorithms into comparison tables for algorithm selection.

Description

After running each algorithm through the prepare-train-predict-evaluate pipeline, the benchmark produces a collection of metric dictionaries and Timer objects for each algorithm. The Results Compilation principle defines how these individual results are aggregated into a single comparison DataFrame that enables side-by-side analysis.

The compilation process:

Collects timing data (training time, prediction time, recommendation time) from Timer objects.
Collects rating metrics (RMSE, MAE, R2, Explained Variance) where available, using NaN for algorithms that do not produce rating predictions.
Collects ranking metrics (MAP, nDCG@k, Precision@k, Recall@k) where available.
Merges all results into a single DataFrame with one row per (dataset_size, algorithm) combination and columns for all metrics.

This unified results table supports:

Algorithm comparison: Which algorithm achieves the best ranking/rating metrics?
Performance analysis: Which algorithms are fastest to train and predict?
Tradeoff analysis: How do accuracy and speed trade off across algorithms?
Scalability analysis: How do metrics and timing change across dataset sizes (100K, 1M, 10M, 20M)?

Usage

Use this principle at the end of a benchmark run to compile all individual algorithm results into a comparison table. The compiled DataFrame is the primary artifact for algorithm selection decisions.

Theoretical Basis

The results compilation follows a collect-merge-analyze pattern:

For each (data_size, algorithm) pair:
  summary = {
    "Data": data_size,
    "Algo": algorithm_name,
    "K": top_k_value,
    "Train time (s)": timer_train.interval,
    "Predicting time (s)": timer_rating.interval or NaN,
    "Recommending time (s)": timer_ranking.interval or NaN,
    "RMSE": rating_metrics["RMSE"] or NaN,
    "MAE": rating_metrics["MAE"] or NaN,
    "R2": rating_metrics["R2"] or NaN,
    "Explained Variance": rating_metrics["Explained Variance"] or NaN,
    "MAP": ranking_metrics["MAP"] or NaN,
    "nDCG@k": ranking_metrics["nDCG@k"] or NaN,
    "Precision@k": ranking_metrics["Precision@k"] or NaN,
    "Recall@k": ranking_metrics["Recall@k"] or NaN,
  }
  Append summary as a row to df_results

Algorithms that do not support a particular metric type (e.g., SAR does not produce rating predictions) have NaN in the corresponding columns. This design choice keeps the table uniform and avoids separate tables for different metric subsets.

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Benchmark_Results_Analysis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment