Principle:Recommenders team Recommenders Benchmark Metric Evaluation

Knowledge Sources	Recommenders benchmark_utils.py
Domains	Recommender Systems, Benchmarking, Evaluation
Last Updated	2026-02-10 00:00 GMT

Overview

Unified metric computation for benchmarking computes standardized rating and ranking metrics across all algorithms using both Python and Spark evaluation backends.

Description

When benchmarking recommendation algorithms, metrics must be computed consistently to ensure fair comparison. Different algorithms run on different backends (Python/pandas vs. PySpark), so the evaluation functions must also support both backends while producing equivalent results.

The Benchmark Metric Evaluation principle provides four functions that form a 2x2 matrix:

	Rating Metrics	Ranking Metrics
Python backend	rating_metrics_python	ranking_metrics_python
PySpark backend	rating_metrics_pyspark	ranking_metrics_pyspark

Each function:

Accepts the test set and model predictions (rating or ranking).
Computes a fixed set of standard metrics.
Returns a dictionary with consistent string keys, enabling uniform aggregation.

Rating metrics (for algorithms that predict ratings):

RMSE -- Root Mean Squared Error
MAE -- Mean Absolute Error
R2 -- R-squared (coefficient of determination)
Explained Variance

Ranking metrics (for algorithms that produce top-K recommendations):

MAP -- Mean Average Precision
nDCG@k -- Normalized Discounted Cumulative Gain at k
Precision@k -- Precision at k
Recall@k -- Recall at k

The consistent dictionary keys across Python and PySpark backends allow the benchmark loop to merge results without backend-specific handling.

Usage

Use this principle when evaluating algorithm outputs in a benchmark. Select the Python or PySpark variant based on whether the algorithm's predictions are pandas DataFrames or Spark DataFrames.

Theoretical Basis

Rating Metrics:

RMSE = sqrt(mean((y_true - y_pred)^2))
MAE  = mean(|y_true - y_pred|)
R2   = 1 - SS_res / SS_tot
ExpVar = 1 - Var(y_true - y_pred) / Var(y_true)

Ranking Metrics:

Precision@k = |relevant items in top-k| / k
Recall@k    = |relevant items in top-k| / |all relevant items|
MAP         = mean of Average Precision over all users
nDCG@k      = DCG@k / IDCG@k
  where DCG@k = sum_{i=1}^{k} (2^{rel_i} - 1) / log2(i + 1)

Both Python and PySpark backends implement the same mathematical definitions, but operate on different DataFrame types. The evaluation dispatching is determined by the algorithm's execution environment (Python CPU/GPU algorithms use the Python backend; Spark algorithms use the PySpark backend).

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Benchmark_Metric_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment