Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Recommenders team Recommenders Benchmark Metric Evaluation

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Benchmarking, Evaluation
Last Updated 2026-02-10 00:00 GMT

Overview

Unified metric computation for benchmarking computes standardized rating and ranking metrics across all algorithms using both Python and Spark evaluation backends.

Description

When benchmarking recommendation algorithms, metrics must be computed consistently to ensure fair comparison. Different algorithms run on different backends (Python/pandas vs. PySpark), so the evaluation functions must also support both backends while producing equivalent results.

The Benchmark Metric Evaluation principle provides four functions that form a 2x2 matrix:

Rating Metrics Ranking Metrics
Python backend rating_metrics_python ranking_metrics_python
PySpark backend rating_metrics_pyspark ranking_metrics_pyspark

Each function:

  1. Accepts the test set and model predictions (rating or ranking).
  2. Computes a fixed set of standard metrics.
  3. Returns a dictionary with consistent string keys, enabling uniform aggregation.

Rating metrics (for algorithms that predict ratings):

  • RMSE -- Root Mean Squared Error
  • MAE -- Mean Absolute Error
  • R2 -- R-squared (coefficient of determination)
  • Explained Variance

Ranking metrics (for algorithms that produce top-K recommendations):

  • MAP -- Mean Average Precision
  • nDCG@k -- Normalized Discounted Cumulative Gain at k
  • Precision@k -- Precision at k
  • Recall@k -- Recall at k

The consistent dictionary keys across Python and PySpark backends allow the benchmark loop to merge results without backend-specific handling.

Usage

Use this principle when evaluating algorithm outputs in a benchmark. Select the Python or PySpark variant based on whether the algorithm's predictions are pandas DataFrames or Spark DataFrames.

Theoretical Basis

Rating Metrics:

RMSE = sqrt(mean((y_true - y_pred)^2))
MAE  = mean(|y_true - y_pred|)
R2   = 1 - SS_res / SS_tot
ExpVar = 1 - Var(y_true - y_pred) / Var(y_true)

Ranking Metrics:

Precision@k = |relevant items in top-k| / k
Recall@k    = |relevant items in top-k| / |all relevant items|
MAP         = mean of Average Precision over all users
nDCG@k      = DCG@k / IDCG@k
  where DCG@k = sum_{i=1}^{k} (2^{rel_i} - 1) / log2(i + 1)

Both Python and PySpark backends implement the same mathematical definitions, but operate on different DataFrame types. The evaluation dispatching is determined by the algorithm's execution environment (Python CPU/GPU algorithms use the Python backend; Spark algorithms use the PySpark backend).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment