Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Benchmark Results Analysis

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Benchmarking, Analysis
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for aggregating per-algorithm metric dictionaries and timing data into a unified comparison DataFrame for benchmark analysis.

Description

This is a notebook-level pattern (not a reusable library function) implemented in the benchmark notebook at examples/06_benchmarks/movielens.ipynb. The pattern defines:

  1. A generate_summary helper function that merges timing data and metric dictionaries into a single row dictionary, filling NaN for unsupported metrics.
  2. A benchmark loop that iterates over dataset sizes and algorithms, calling the prepare-train-predict-evaluate pipeline and collecting results.
  3. A results DataFrame (df_results) that accumulates all summary rows for analysis.

The generate_summary function handles the heterogeneity of algorithm capabilities: algorithms that do not produce rating predictions (SAR, NCF, BPR, BiVAE, LightGCN) have NaN for RMSE, MAE, R2, and Explained Variance. Algorithms that are not evaluated for ranking have NaN for MAP, nDCG@k, Precision@k, and Recall@k.

Usage

Use this pattern when implementing a new benchmark notebook or extending the existing one. The pattern is flexible enough to accommodate new algorithms by adding entries to the dispatch dictionaries and extending the algorithm list.

Interface Specification

generate_summary Function

def generate_summary(
    data,             # str: dataset size identifier (e.g., "100k", "1m")
    algo,             # str: algorithm name (e.g., "als", "sar")
    k,                # int: top-K value used for ranking metrics
    train_time,       # Timer: training time from train_* function
    time_rating,      # Timer or np.nan: prediction time from predict_*
    rating_metrics,   # dict or None: {"RMSE": float, "MAE": float, ...}
    time_ranking,     # Timer or np.nan: recommendation time from recommend_k_*
    ranking_metrics,  # dict or None: {"MAP": float, "nDCG@k": float, ...}
) -> dict:
    """Merge all metrics and timing into a single summary dictionary."""

Results DataFrame Schema

Column Type Description
Data str Dataset size identifier (e.g., "100k", "1m")
Algo str Algorithm name (e.g., "als", "sar", "svd")
K int Top-K value used for ranking evaluation
Train time (s) float Wall-clock training time in seconds
Predicting time (s) float or NaN Wall-clock rating prediction time
Recommending time (s) float or NaN Wall-clock ranking recommendation time
RMSE float or NaN Root Mean Squared Error
MAE float or NaN Mean Absolute Error
R2 float or NaN R-squared
Explained Variance float or NaN Explained Variance score
MAP float or NaN Mean Average Precision
nDCG@k float or NaN Normalized Discounted Cumulative Gain at k
Precision@k float or NaN Precision at k
Recall@k float or NaN Recall at k

Benchmark Loop Pattern

import numpy as np
import pandas as pd
from benchmark_utils import *
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_stratified_split
from recommenders.utils.constants import DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL, DEFAULT_K

# Define the results DataFrame
cols = [
    "Data", "Algo", "K",
    "Train time (s)", "Predicting time (s)",
    "RMSE", "MAE", "R2", "Explained Variance",
    "Recommending time (s)",
    "MAP", "nDCG@k", "Precision@k", "Recall@k",
]
df_results = pd.DataFrame(columns=cols)

# Configuration: which metric types each algorithm supports
metrics = {
    "als": ["rating", "ranking"],
    "sar": ["ranking"],
    "svd": ["rating", "ranking"],
    "embdotbias": ["rating", "ranking"],
    "ncf": ["ranking"],
    "bpr": ["ranking"],
    "bivae": ["ranking"],
    "lightgcn": ["ranking"],
}

# Main benchmark loop
for data_size in data_sizes:
    df = movielens.load_pandas_df(
        size=data_size,
        header=[DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL],
    )
    df_train, df_test = python_stratified_split(
        df, ratio=0.75, min_rating=1, filter_by="item",
        col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
    )

    for algo in algorithms:
        # Prepare data
        train = prepare_training_data[algo](df_train, df_test)
        # Train model
        model, time_train = trainer[algo](params[algo], train)
        # Prepare metric data (backend-specific conversion if needed)
        train, test = prepare_metrics_data.get(algo, lambda x, y: (x, y))(df_train, df_test)

        # Rating prediction (if supported)
        if "rating" in metrics[algo]:
            preds, time_rating = rating_predictor[algo](model, test)
            ratings = rating_evaluator[algo](test, preds)
        else:
            ratings = None
            time_rating = np.nan

        # Ranking recommendation (if supported)
        if "ranking" in metrics[algo]:
            top_k_scores, time_ranking = ranking_predictor[algo](model, test, train)
            rankings = ranking_evaluator[algo](test, top_k_scores, DEFAULT_K)
        else:
            rankings = None
            time_ranking = np.nan

        # Compile results
        summary = generate_summary(
            data_size, algo, DEFAULT_K,
            time_train, time_rating, ratings,
            time_ranking, rankings,
        )
        df_results.loc[df_results.shape[0] + 1] = summary

# Display the results table
df_results

Example Output

   Data        Algo   K  Train time (s)  Predicting time (s)    RMSE     MAE      R2  Explained Variance  Recommending time (s)    MAP    nDCG@k  Precision@k  Recall@k
1  100k         als  10         12.3537               0.0996  0.9665  0.7538  0.2683              0.2639                 0.1375  0.0033  0.0332       0.0385    0.0134
2  100k         svd  10          0.9307               0.1459  0.9425  0.7448  0.3000              0.3000                13.2209  0.0121  0.0944       0.0891    0.0301
3  100k         sar  10          0.2343                  NaN     NaN     NaN     NaN                 NaN                 0.0936  0.1140  0.3938       0.3406    0.1854
4  100k         ncf  10        113.2512                  NaN     NaN     NaN     NaN                 NaN                 8.9240  0.0952  0.3686       0.3269    0.1632
5  100k  embdotbias  10         81.8275               0.0344  0.9928  0.7760  0.2233              0.2234                 1.6463  0.0190  0.1178       0.1042    0.0425
6  100k         bpr  10          4.9720                  NaN     NaN     NaN     NaN                 NaN                 0.5015  0.1340  0.4450       0.3887    0.2166
7  100k       bivae  10         22.6604                  NaN     NaN     NaN     NaN                 NaN                 0.6609  0.1436  0.4687       0.4082    0.2209
8  100k    lightgcn  10         23.1118                  NaN     NaN     NaN     NaN                 NaN                 0.0602  0.1209  0.4164       0.3592    0.1960

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment