Implementation:Recommenders team Recommenders Benchmark Results Analysis

Knowledge Sources	Recommenders movielens.ipynb
Domains	Recommender Systems, Benchmarking, Analysis
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for aggregating per-algorithm metric dictionaries and timing data into a unified comparison DataFrame for benchmark analysis.

Description

This is a notebook-level pattern (not a reusable library function) implemented in the benchmark notebook at examples/06_benchmarks/movielens.ipynb. The pattern defines:

A generate_summary helper function that merges timing data and metric dictionaries into a single row dictionary, filling NaN for unsupported metrics.
A benchmark loop that iterates over dataset sizes and algorithms, calling the prepare-train-predict-evaluate pipeline and collecting results.
A results DataFrame (df_results) that accumulates all summary rows for analysis.

The generate_summary function handles the heterogeneity of algorithm capabilities: algorithms that do not produce rating predictions (SAR, NCF, BPR, BiVAE, LightGCN) have NaN for RMSE, MAE, R2, and Explained Variance. Algorithms that are not evaluated for ranking have NaN for MAP, nDCG@k, Precision@k, and Recall@k.

Usage

Use this pattern when implementing a new benchmark notebook or extending the existing one. The pattern is flexible enough to accommodate new algorithms by adding entries to the dispatch dictionaries and extending the algorithm list.

Interface Specification

generate_summary Function

def generate_summary(
    data,             # str: dataset size identifier (e.g., "100k", "1m")
    algo,             # str: algorithm name (e.g., "als", "sar")
    k,                # int: top-K value used for ranking metrics
    train_time,       # Timer: training time from train_* function
    time_rating,      # Timer or np.nan: prediction time from predict_*
    rating_metrics,   # dict or None: {"RMSE": float, "MAE": float, ...}
    time_ranking,     # Timer or np.nan: recommendation time from recommend_k_*
    ranking_metrics,  # dict or None: {"MAP": float, "nDCG@k": float, ...}
) -> dict:
    """Merge all metrics and timing into a single summary dictionary."""

Results DataFrame Schema

Column	Type	Description
Data	str	Dataset size identifier (e.g., "100k", "1m")
Algo	str	Algorithm name (e.g., "als", "sar", "svd")
K	int	Top-K value used for ranking evaluation
Train time (s)	float	Wall-clock training time in seconds
Predicting time (s)	float or NaN	Wall-clock rating prediction time
Recommending time (s)	float or NaN	Wall-clock ranking recommendation time
RMSE	float or NaN	Root Mean Squared Error
MAE	float or NaN	Mean Absolute Error
R2	float or NaN	R-squared
Explained Variance	float or NaN	Explained Variance score
MAP	float or NaN	Mean Average Precision
nDCG@k	float or NaN	Normalized Discounted Cumulative Gain at k
Precision@k	float or NaN	Precision at k
Recall@k	float or NaN	Recall at k

Benchmark Loop Pattern

import numpy as np
import pandas as pd
from benchmark_utils import *
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.utils.constants import DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL, DEFAULT_K

# Define the results DataFrame
cols = [
    "Data", "Algo", "K",
    "Train time (s)", "Predicting time (s)",
    "RMSE", "MAE", "R2", "Explained Variance",
    "Recommending time (s)",
    "MAP", "nDCG@k", "Precision@k", "Recall@k",
]
df_results = pd.DataFrame(columns=cols)

# Configuration: which metric types each algorithm supports
metrics = {
    "als": ["rating", "ranking"],
    "sar": ["ranking"],
    "svd": ["rating", "ranking"],
    "embdotbias": ["rating", "ranking"],
    "ncf": ["ranking"],
    "bpr": ["ranking"],
    "bivae": ["ranking"],
    "lightgcn": ["ranking"],
}

# Main benchmark loop
for data_size in data_sizes:
    df = movielens.load_pandas_df(
        size=data_size,
        header=[DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL],
    )
    df_train, df_test = python_stratified_split(
        df, ratio=0.75, min_rating=1, filter_by="item",
        col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
    )

    for algo in algorithms:
        # Prepare data
        train = prepare_training_data[algo](df_train, df_test)
        # Train model
        model, time_train = trainer[algo](params[algo], train)
        # Prepare metric data (backend-specific conversion if needed)
        train, test = prepare_metrics_data.get(algo, lambda x, y: (x, y))(df_train, df_test)

        # Rating prediction (if supported)
        if "rating" in metrics[algo]:
            preds, time_rating = rating_predictor[algo](model, test)
            ratings = rating_evaluator[algo](test, preds)
        else:
            ratings = None
            time_rating = np.nan

        # Ranking recommendation (if supported)
        if "ranking" in metrics[algo]:
            top_k_scores, time_ranking = ranking_predictor[algo](model, test, train)
            rankings = ranking_evaluator[algo](test, top_k_scores, DEFAULT_K)
        else:
            rankings = None
            time_ranking = np.nan

        # Compile results
        summary = generate_summary(
            data_size, algo, DEFAULT_K,
            time_train, time_rating, ratings,
            time_ranking, rankings,
        )
        df_results.loc[df_results.shape[0] + 1] = summary

# Display the results table
df_results

Example Output

   Data        Algo   K  Train time (s)  Predicting time (s)    RMSE     MAE      R2  Explained Variance  Recommending time (s)    MAP    nDCG@k  Precision@k  Recall@k
1  100k         als  10         12.3537               0.0996  0.9665  0.7538  0.2683              0.2639                 0.1375  0.0033  0.0332       0.0385    0.0134
2  100k         svd  10          0.9307               0.1459  0.9425  0.7448  0.3000              0.3000                13.2209  0.0121  0.0944       0.0891    0.0301
3  100k         sar  10          0.2343                  NaN     NaN     NaN     NaN                 NaN                 0.0936  0.1140  0.3938       0.3406    0.1854
4  100k         ncf  10        113.2512                  NaN     NaN     NaN     NaN                 NaN                 8.9240  0.0952  0.3686       0.3269    0.1632
5  100k  embdotbias  10         81.8275               0.0344  0.9928  0.7760  0.2233              0.2234                 1.6463  0.0190  0.1178       0.1042    0.0425
6  100k         bpr  10          4.9720                  NaN     NaN     NaN     NaN                 NaN                 0.5015  0.1340  0.4450       0.3887    0.2166
7  100k       bivae  10         22.6604                  NaN     NaN     NaN     NaN                 NaN                 0.6609  0.1436  0.4687       0.4082    0.2209
8  100k    lightgcn  10         23.1118                  NaN     NaN     NaN     NaN                 NaN                 0.0602  0.1209  0.4164       0.3592    0.1960

Related Pages

Implements Principle

Principle:Recommenders_team_Recommenders_Benchmark_Results_Compilation

Requires Environment

Environment:Recommenders_team_Recommenders_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment