Principle:Recommenders team Recommenders Benchmark Metric Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Benchmarking, Evaluation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Unified metric computation for benchmarking computes standardized rating and ranking metrics across all algorithms using both Python and Spark evaluation backends.
Description
When benchmarking recommendation algorithms, metrics must be computed consistently to ensure fair comparison. Different algorithms run on different backends (Python/pandas vs. PySpark), so the evaluation functions must also support both backends while producing equivalent results.
The Benchmark Metric Evaluation principle provides four functions that form a 2x2 matrix:
| Rating Metrics | Ranking Metrics | |
|---|---|---|
| Python backend | rating_metrics_python | ranking_metrics_python |
| PySpark backend | rating_metrics_pyspark | ranking_metrics_pyspark |
Each function:
- Accepts the test set and model predictions (rating or ranking).
- Computes a fixed set of standard metrics.
- Returns a dictionary with consistent string keys, enabling uniform aggregation.
Rating metrics (for algorithms that predict ratings):
- RMSE -- Root Mean Squared Error
- MAE -- Mean Absolute Error
- R2 -- R-squared (coefficient of determination)
- Explained Variance
Ranking metrics (for algorithms that produce top-K recommendations):
- MAP -- Mean Average Precision
- nDCG@k -- Normalized Discounted Cumulative Gain at k
- Precision@k -- Precision at k
- Recall@k -- Recall at k
The consistent dictionary keys across Python and PySpark backends allow the benchmark loop to merge results without backend-specific handling.
Usage
Use this principle when evaluating algorithm outputs in a benchmark. Select the Python or PySpark variant based on whether the algorithm's predictions are pandas DataFrames or Spark DataFrames.
Theoretical Basis
Rating Metrics:
RMSE = sqrt(mean((y_true - y_pred)^2))
MAE = mean(|y_true - y_pred|)
R2 = 1 - SS_res / SS_tot
ExpVar = 1 - Var(y_true - y_pred) / Var(y_true)
Ranking Metrics:
Precision@k = |relevant items in top-k| / k
Recall@k = |relevant items in top-k| / |all relevant items|
MAP = mean of Average Precision over all users
nDCG@k = DCG@k / IDCG@k
where DCG@k = sum_{i=1}^{k} (2^{rel_i} - 1) / log2(i + 1)
Both Python and PySpark backends implement the same mathematical definitions, but operate on different DataFrame types. The evaluation dispatching is determined by the algorithm's execution environment (Python CPU/GPU algorithms use the Python backend; Spark algorithms use the PySpark backend).