Implementation:Recommenders team Recommenders Python Evaluation Metrics
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Evaluation Metrics, Information Retrieval |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tools for computing rating and ranking evaluation metrics for recommender systems provided by the recommenders library.
Description
The recommenders.evaluation.python_evaluation module provides a suite of evaluation functions for measuring recommender system performance. The module includes:
- Rating metrics:
rmseandmaefor measuring prediction accuracy against ground truth ratings. - Ranking metrics:
precision_at_k,recall_at_k,ndcg_at_k, andmapfor measuring the quality of top-K recommendation lists.
All functions accept two DataFrames (ground truth and predictions) and return a single float score. They share a common interface for column name configuration and handle the merging of true and predicted data internally.
Usage
Import these functions at the evaluation stage of a recommender system pipeline, after generating predictions or recommendation lists. Rating metrics are used with rating prediction outputs; ranking metrics are used with top-K recommendation outputs.
Code Reference
Source Location
- Repository: recommenders
- File:
recommenders/evaluation/python_evaluation.py - Lines:
rmse: L165-L195mae: L198-L228precision_at_k: L448-L496recall_at_k: L499-L541ndcg_at_k: L601-L696map: L734-L785
Signature
def rmse(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL, col_prediction=DEFAULT_PREDICTION_COL,
) -> float
def mae(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL, col_prediction=DEFAULT_PREDICTION_COL,
) -> float
def precision_at_k(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_prediction=DEFAULT_PREDICTION_COL,
relevancy_method="top_k", k=DEFAULT_K, threshold=DEFAULT_THRESHOLD,
) -> float
def recall_at_k(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_prediction=DEFAULT_PREDICTION_COL,
relevancy_method="top_k", k=DEFAULT_K, threshold=DEFAULT_THRESHOLD,
) -> float
def ndcg_at_k(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL, col_prediction=DEFAULT_PREDICTION_COL,
relevancy_method="top_k", k=DEFAULT_K, threshold=DEFAULT_THRESHOLD,
score_type="binary", discfun_type="loge",
) -> float
def map(
rating_true, rating_pred,
col_user=DEFAULT_USER_COL, col_item=DEFAULT_ITEM_COL,
col_prediction=DEFAULT_PREDICTION_COL,
relevancy_method="top_k", k=DEFAULT_K, threshold=DEFAULT_THRESHOLD,
) -> float
Import
from recommenders.evaluation.python_evaluation import (
rmse,
mae,
precision_at_k,
recall_at_k,
ndcg_at_k,
map,
)
I/O Contract
Inputs (Rating Metrics: rmse, mae)
| Name | Type | Required | Description |
|---|---|---|---|
| rating_true | pd.DataFrame | Yes | Ground truth DataFrame with user-item-rating columns. Must have no duplicate (user, item) pairs. |
| rating_pred | pd.DataFrame | Yes | Predicted DataFrame with user-item-prediction columns. Must have no duplicate (user, item) pairs. |
| col_user | str | No (default: DEFAULT_USER_COL) | Column name for user IDs. |
| col_item | str | No (default: DEFAULT_ITEM_COL) | Column name for item IDs. |
| col_rating | str | No (default: DEFAULT_RATING_COL) | Column name for true rating values. |
| col_prediction | str | No (default: DEFAULT_PREDICTION_COL) | Column name for predicted rating values. |
Inputs (Ranking Metrics: precision_at_k, recall_at_k, ndcg_at_k, map)
| Name | Type | Required | Description |
|---|---|---|---|
| rating_true | pd.DataFrame | Yes | Ground truth DataFrame with user-item columns representing relevant items. |
| rating_pred | pd.DataFrame | Yes | Predicted DataFrame with user-item-prediction columns representing the recommendation list. |
| col_user | str | No (default: DEFAULT_USER_COL) | Column name for user IDs. |
| col_item | str | No (default: DEFAULT_ITEM_COL) | Column name for item IDs. |
| col_prediction | str | No (default: DEFAULT_PREDICTION_COL) | Column name for predicted scores. |
| relevancy_method | str | No (default: "top_k") | Method for determining relevancy: "top_k", "by_threshold", or None. |
| k | int | No (default: DEFAULT_K) | Number of top items per user for evaluation. |
| threshold | float | No (default: DEFAULT_THRESHOLD) | Threshold for relevancy when using "by_threshold" method. |
Additional Inputs (ndcg_at_k only)
| Name | Type | Required | Description |
|---|---|---|---|
| col_rating | str | No (default: DEFAULT_RATING_COL) | Column name for true rating values, used for graded relevance. |
| score_type | str | No (default: "binary") | Type of relevance scoring: "binary" (hit/miss), "raw" (use rating directly), or "exp" (2^rating - 1). |
| discfun_type | str | No (default: "loge") | Discount function: "loge" (natural log) or "log2" (base-2 log). |
Outputs
| Name | Type | Description |
|---|---|---|
| return (rmse) | float | Root Mean Squared Error. Range: [0, +inf). Lower is better. |
| return (mae) | float | Mean Absolute Error. Range: [0, +inf). Lower is better. |
| return (precision_at_k) | float | Precision at K. Range: [0, 1]. Higher is better. |
| return (recall_at_k) | float | Recall at K. Range: [0, 1]. Higher is better. |
| return (ndcg_at_k) | float | Normalized Discounted Cumulative Gain at K. Range: [0, 1]. Higher is better. |
| return (map) | float | Mean Average Precision at K. Range: [0, 1]. Higher is better. |
Usage Examples
Basic Usage
from recommenders.evaluation.python_evaluation import (
rmse,
mae,
precision_at_k,
recall_at_k,
ndcg_at_k,
map,
)
# Rating metrics (using rating predictions)
eval_rmse = rmse(test_df, pred_df, col_prediction="prediction")
eval_mae = mae(test_df, pred_df, col_prediction="prediction")
print(f"RMSE: {eval_rmse:.4f}")
print(f"MAE: {eval_mae:.4f}")
# Ranking metrics (using top-K recommendation lists)
eval_precision = precision_at_k(test_df, top_k_df, col_prediction="prediction", k=10)
eval_recall = recall_at_k(test_df, top_k_df, col_prediction="prediction", k=10)
eval_ndcg = ndcg_at_k(test_df, top_k_df, col_prediction="prediction", k=10)
eval_map = map(test_df, top_k_df, col_prediction="prediction", k=10)
print(f"Precision@10: {eval_precision:.4f}")
print(f"Recall@10: {eval_recall:.4f}")
print(f"NDCG@10: {eval_ndcg:.4f}")
print(f"MAP@10: {eval_map:.4f}")
Dependencies
- numpy - Numerical computation (sqrt, mean operations)
- pandas - DataFrame merging and groupby operations
- sklearn.metrics - mean_squared_error, mean_absolute_error base implementations