Implementation:Recommenders team Recommenders Benchmark Results Analysis
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Benchmarking, Analysis |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for aggregating per-algorithm metric dictionaries and timing data into a unified comparison DataFrame for benchmark analysis.
Description
This is a notebook-level pattern (not a reusable library function) implemented in the benchmark notebook at examples/06_benchmarks/movielens.ipynb. The pattern defines:
- A generate_summary helper function that merges timing data and metric dictionaries into a single row dictionary, filling NaN for unsupported metrics.
- A benchmark loop that iterates over dataset sizes and algorithms, calling the prepare-train-predict-evaluate pipeline and collecting results.
- A results DataFrame (
df_results) that accumulates all summary rows for analysis.
The generate_summary function handles the heterogeneity of algorithm capabilities: algorithms that do not produce rating predictions (SAR, NCF, BPR, BiVAE, LightGCN) have NaN for RMSE, MAE, R2, and Explained Variance. Algorithms that are not evaluated for ranking have NaN for MAP, nDCG@k, Precision@k, and Recall@k.
Usage
Use this pattern when implementing a new benchmark notebook or extending the existing one. The pattern is flexible enough to accommodate new algorithms by adding entries to the dispatch dictionaries and extending the algorithm list.
Interface Specification
generate_summary Function
def generate_summary(
data, # str: dataset size identifier (e.g., "100k", "1m")
algo, # str: algorithm name (e.g., "als", "sar")
k, # int: top-K value used for ranking metrics
train_time, # Timer: training time from train_* function
time_rating, # Timer or np.nan: prediction time from predict_*
rating_metrics, # dict or None: {"RMSE": float, "MAE": float, ...}
time_ranking, # Timer or np.nan: recommendation time from recommend_k_*
ranking_metrics, # dict or None: {"MAP": float, "nDCG@k": float, ...}
) -> dict:
"""Merge all metrics and timing into a single summary dictionary."""
Results DataFrame Schema
| Column | Type | Description |
|---|---|---|
| Data | str | Dataset size identifier (e.g., "100k", "1m") |
| Algo | str | Algorithm name (e.g., "als", "sar", "svd") |
| K | int | Top-K value used for ranking evaluation |
| Train time (s) | float | Wall-clock training time in seconds |
| Predicting time (s) | float or NaN | Wall-clock rating prediction time |
| Recommending time (s) | float or NaN | Wall-clock ranking recommendation time |
| RMSE | float or NaN | Root Mean Squared Error |
| MAE | float or NaN | Mean Absolute Error |
| R2 | float or NaN | R-squared |
| Explained Variance | float or NaN | Explained Variance score |
| MAP | float or NaN | Mean Average Precision |
| nDCG@k | float or NaN | Normalized Discounted Cumulative Gain at k |
| Precision@k | float or NaN | Precision at k |
| Recall@k | float or NaN | Recall at k |
Benchmark Loop Pattern
import numpy as np
import pandas as pd
from benchmark_utils import *
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.utils.constants import DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL, DEFAULT_K
# Define the results DataFrame
cols = [
"Data", "Algo", "K",
"Train time (s)", "Predicting time (s)",
"RMSE", "MAE", "R2", "Explained Variance",
"Recommending time (s)",
"MAP", "nDCG@k", "Precision@k", "Recall@k",
]
df_results = pd.DataFrame(columns=cols)
# Configuration: which metric types each algorithm supports
metrics = {
"als": ["rating", "ranking"],
"sar": ["ranking"],
"svd": ["rating", "ranking"],
"embdotbias": ["rating", "ranking"],
"ncf": ["ranking"],
"bpr": ["ranking"],
"bivae": ["ranking"],
"lightgcn": ["ranking"],
}
# Main benchmark loop
for data_size in data_sizes:
df = movielens.load_pandas_df(
size=data_size,
header=[DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL],
)
df_train, df_test = python_stratified_split(
df, ratio=0.75, min_rating=1, filter_by="item",
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
)
for algo in algorithms:
# Prepare data
train = prepare_training_data[algo](df_train, df_test)
# Train model
model, time_train = trainer[algo](params[algo], train)
# Prepare metric data (backend-specific conversion if needed)
train, test = prepare_metrics_data.get(algo, lambda x, y: (x, y))(df_train, df_test)
# Rating prediction (if supported)
if "rating" in metrics[algo]:
preds, time_rating = rating_predictor[algo](model, test)
ratings = rating_evaluator[algo](test, preds)
else:
ratings = None
time_rating = np.nan
# Ranking recommendation (if supported)
if "ranking" in metrics[algo]:
top_k_scores, time_ranking = ranking_predictor[algo](model, test, train)
rankings = ranking_evaluator[algo](test, top_k_scores, DEFAULT_K)
else:
rankings = None
time_ranking = np.nan
# Compile results
summary = generate_summary(
data_size, algo, DEFAULT_K,
time_train, time_rating, ratings,
time_ranking, rankings,
)
df_results.loc[df_results.shape[0] + 1] = summary
# Display the results table
df_results
Example Output
Data Algo K Train time (s) Predicting time (s) RMSE MAE R2 Explained Variance Recommending time (s) MAP nDCG@k Precision@k Recall@k
1 100k als 10 12.3537 0.0996 0.9665 0.7538 0.2683 0.2639 0.1375 0.0033 0.0332 0.0385 0.0134
2 100k svd 10 0.9307 0.1459 0.9425 0.7448 0.3000 0.3000 13.2209 0.0121 0.0944 0.0891 0.0301
3 100k sar 10 0.2343 NaN NaN NaN NaN NaN 0.0936 0.1140 0.3938 0.3406 0.1854
4 100k ncf 10 113.2512 NaN NaN NaN NaN NaN 8.9240 0.0952 0.3686 0.3269 0.1632
5 100k embdotbias 10 81.8275 0.0344 0.9928 0.7760 0.2233 0.2234 1.6463 0.0190 0.1178 0.1042 0.0425
6 100k bpr 10 4.9720 NaN NaN NaN NaN NaN 0.5015 0.1340 0.4450 0.3887 0.2166
7 100k bivae 10 22.6604 NaN NaN NaN NaN NaN 0.6609 0.1436 0.4687 0.4082 0.2209
8 100k lightgcn 10 23.1118 NaN NaN NaN NaN NaN 0.0602 0.1209 0.4164 0.3592 0.1960