Principle:Recommenders team Recommenders Spark Evaluation Metrics
| Knowledge Sources | |
|---|---|
| Domains | Model Evaluation, Recommendation Systems, Distributed Computing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Distributed evaluation of recommender systems using Spark computes rating accuracy and ranking quality metrics at scale using PySpark's distributed computing capabilities.
Description
Evaluating a recommendation model requires comparing predicted ratings or rankings against held-out ground truth data. In a distributed Spark environment, this evaluation must be performed without collecting all data to the driver node, as the datasets may be too large to fit in memory.
The evaluation framework divides into two complementary categories:
- Rating Metrics measure how accurately the model predicts the exact rating value. These are regression-style metrics computed on the set of (true_rating, predicted_rating) pairs:
- RMSE (Root Mean Squared Error): Penalizes large errors more heavily; standard metric for rating prediction accuracy.
- MAE (Mean Absolute Error): Average magnitude of prediction errors; more robust to outliers than RMSE.
- R-squared (R^2): Proportion of rating variance explained by the model; 1.0 is perfect, 0.0 is no better than predicting the mean.
- Explained Variance: Similar to R^2 but does not account for systematic bias in predictions.
- Ranking Metrics measure how well the model orders items for each user, regardless of the exact predicted score. These are computed by comparing the model's top-K recommendations against each user's true relevant items:
- Precision@K: Fraction of recommended items that are relevant.
- Recall@K: Fraction of relevant items that are recommended.
- NDCG@K (Normalized Discounted Cumulative Gain): Measures ranking quality with position-dependent weighting; items ranked higher receive more credit.
- MAP (Mean Average Precision): Average of precision values computed at each relevant item's position in the ranked list.
- MAP@K: MAP truncated to the top K positions.
Both categories leverage PySpark's RegressionMetrics and RankingMetrics from pyspark.mllib.evaluation for the core computations, which are executed distributedly across the cluster.
Usage
Use evaluation metrics after generating predictions from a trained model. Rating metrics are used when the goal is to predict exact ratings (explicit feedback scenarios). Ranking metrics are used when the goal is to produce an ordered list of recommendations (both explicit and implicit feedback scenarios). In practice, both types of metrics are typically computed together to provide a comprehensive assessment of model quality.
Theoretical Basis
Rating Metrics
Given N test pairs: {(r_1, r_hat_1), (r_2, r_hat_2), ..., (r_N, r_hat_N)}
where r_i = true rating, r_hat_i = predicted rating
RMSE = sqrt( (1/N) * sum_{i=1}^{N} (r_i - r_hat_i)^2 )
MAE = (1/N) * sum_{i=1}^{N} |r_i - r_hat_i|
R^2 = 1 - sum(r_i - r_hat_i)^2 / sum(r_i - r_bar)^2
where r_bar = mean of true ratings
Explained Variance = 1 - Var(r - r_hat) / Var(r)
Ranking Metrics
For each user u, let:
L_u = ordered list of top-K recommended items
T_u = set of truly relevant items
Precision@K(u) = |L_u intersect T_u| / K
Recall@K(u) = |L_u intersect T_u| / |T_u|
DCG@K(u) = sum_{i=1}^{K} rel(i) / log2(i + 1)
where rel(i) = 1 if L_u[i] in T_u, else 0
NDCG@K(u) = DCG@K(u) / IDCG@K(u)
where IDCG@K(u) = best possible DCG@K for user u
AP(u) = (1/|T_u|) * sum_{i=1}^{K} Precision@i(u) * rel(i)
MAP = (1/|U|) * sum_{u in U} AP(u)
where U = set of all users
The relevancy method determines which items are considered "relevant" in the ground truth:
"top_k": Items with the highest true ratings are considered relevant (default)."by_threshold": Items with true ratings above a threshold are considered relevant."by_time_stamp": The most recent items by timestamp are considered relevant.