Principle:Recommenders team Recommenders Spark Evaluation Metrics

Knowledge Sources	Recommenders Spark MLlib Evaluation
Domains	Model Evaluation, Recommendation Systems, Distributed Computing
Last Updated	2026-02-10 00:00 GMT

Overview

Distributed evaluation of recommender systems using Spark computes rating accuracy and ranking quality metrics at scale using PySpark's distributed computing capabilities.

Description

Evaluating a recommendation model requires comparing predicted ratings or rankings against held-out ground truth data. In a distributed Spark environment, this evaluation must be performed without collecting all data to the driver node, as the datasets may be too large to fit in memory.

The evaluation framework divides into two complementary categories:

Rating Metrics measure how accurately the model predicts the exact rating value. These are regression-style metrics computed on the set of (true_rating, predicted_rating) pairs:
- RMSE (Root Mean Squared Error): Penalizes large errors more heavily; standard metric for rating prediction accuracy.
- MAE (Mean Absolute Error): Average magnitude of prediction errors; more robust to outliers than RMSE.
- R-squared (R^2): Proportion of rating variance explained by the model; 1.0 is perfect, 0.0 is no better than predicting the mean.
- Explained Variance: Similar to R^2 but does not account for systematic bias in predictions.

Ranking Metrics measure how well the model orders items for each user, regardless of the exact predicted score. These are computed by comparing the model's top-K recommendations against each user's true relevant items:
- Precision@K: Fraction of recommended items that are relevant.
- Recall@K: Fraction of relevant items that are recommended.
- NDCG@K (Normalized Discounted Cumulative Gain): Measures ranking quality with position-dependent weighting; items ranked higher receive more credit.
- MAP (Mean Average Precision): Average of precision values computed at each relevant item's position in the ranked list.
- MAP@K: MAP truncated to the top K positions.

Both categories leverage PySpark's RegressionMetrics and RankingMetrics from pyspark.mllib.evaluation for the core computations, which are executed distributedly across the cluster.

Usage

Use evaluation metrics after generating predictions from a trained model. Rating metrics are used when the goal is to predict exact ratings (explicit feedback scenarios). Ranking metrics are used when the goal is to produce an ordered list of recommendations (both explicit and implicit feedback scenarios). In practice, both types of metrics are typically computed together to provide a comprehensive assessment of model quality.

Theoretical Basis

Rating Metrics

Given N test pairs: {(r_1, r_hat_1), (r_2, r_hat_2), ..., (r_N, r_hat_N)}
where r_i = true rating, r_hat_i = predicted rating

RMSE = sqrt( (1/N) * sum_{i=1}^{N} (r_i - r_hat_i)^2 )

MAE  = (1/N) * sum_{i=1}^{N} |r_i - r_hat_i|

R^2  = 1 - sum(r_i - r_hat_i)^2 / sum(r_i - r_bar)^2
       where r_bar = mean of true ratings

Explained Variance = 1 - Var(r - r_hat) / Var(r)

Ranking Metrics

For each user u, let:
  L_u = ordered list of top-K recommended items
  T_u = set of truly relevant items

Precision@K(u) = |L_u intersect T_u| / K

Recall@K(u)    = |L_u intersect T_u| / |T_u|

DCG@K(u)  = sum_{i=1}^{K} rel(i) / log2(i + 1)
  where rel(i) = 1 if L_u[i] in T_u, else 0

NDCG@K(u) = DCG@K(u) / IDCG@K(u)
  where IDCG@K(u) = best possible DCG@K for user u

AP(u) = (1/|T_u|) * sum_{i=1}^{K} Precision@i(u) * rel(i)

MAP   = (1/|U|) * sum_{u in U} AP(u)
  where U = set of all users

The relevancy method determines which items are considered "relevant" in the ground truth:

"top_k": Items with the highest true ratings are considered relevant (default).
"by_threshold": Items with true ratings above a threshold are considered relevant.
"by_time_stamp": The most recent items by timestamp are considered relevant.

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Spark_Evaluation_Classes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment