Implementation:Evidentlyai Evidently Recsys Metrics
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, ML Monitoring, Ranking Metrics |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Implements recommender system and ranking evaluation metrics for Evidently's V2 metric framework, providing NDCG, MRR, HitRate, MAP, Recall, Precision, F-beta, diversity, novelty, serendipity, personalization, popularity bias, item/user bias, and recommendation case table metrics.
Description
The recsys module provides a comprehensive suite of metrics for evaluating recommender systems and ranking models. All metrics in this module wrap legacy V1 metric implementations via LegacyMetricCalculation and expose them through the V2 metric type system.
Ranking Quality Metrics (Top-K):
All top-k metrics inherit from TopKBase (a DataframeMetric) and use LegacyTopKCalculation for computation. Each metric returns a DataframeValue containing rank-value pairs.
| Metric Class | Description | Display Name |
|---|---|---|
| NDCG | Normalized Discounted Cumulative Gain -- measures ranking quality considering position and relevance | NDCG@k |
| MRR | Mean Reciprocal Rank -- average reciprocal rank of first relevant item | MRR@k |
| HitRate | Hit Rate -- proportion of users with at least one relevant item in top-k | HitRate@k |
| MAP | Mean Average Precision -- average precision across all users | MAP@k |
| RecallTopK | Recall -- proportion of relevant items found in top-k | Recall@k |
| PrecisionTopK | Precision -- proportion of relevant items in top-k | Precision@k |
| FBetaTopK | F-beta -- weighted harmonic mean of precision and recall; configurable beta parameter (default: 1.0) |
F{beta}@k |
TopKBase common parameters:
k: Number of top items to consider.min_rel_score: Optional minimum relevance score threshold.no_feedback_users: Whether to include users with no feedback (default: False).ranking_name: Name of the ranking task definition (default: "default").
Beyond-Accuracy Metrics:
| Metric Class | Type | Description |
|---|---|---|
| ScoreDistribution | SingleValueMetric | Score distribution entropy -- measures diversity of recommendation scores using entropy |
| PopularityBiasMetric | SingleValueMetric | Measures popularity bias using ARP (Average Recommendation Popularity), coverage, or Gini coefficient. Configurable via metric parameter ("arp", "coverage", "gini")
|
| Personalization | SingleValueMetric | Measures how different recommendations are across users |
| Diversity | SingleValueMetric | Measures diversity of items within each user's recommendations based on item_features
|
| Serendipity | SingleValueMetric | Measures how surprising yet relevant recommendations are, using item_features
|
| Novelty | SingleValueMetric | Measures how novel (less popular) recommended items are |
Bias Analysis Metrics:
| Metric Class | Type | Description |
|---|---|---|
| ItemBias | Metric (DataframeValue) | Measures bias in recommendations toward specific item groups. Requires column_name for group column. Supports "default" or "train" distribution
|
| UserBias | Metric (DataframeValue) | Measures bias toward specific user groups. Requires column_name for group column. Supports "default" or "train" distribution
|
Inspection Metrics:
| Metric Class | Type | Description |
|---|---|---|
| RecCasesTable | DataframeMetric | Displays detailed recommendation cases for specific users. Optional user_ids and display_features parameters
|
Helper Function:
- _gen_ranking_input_data(context, task_name) -- Generates InputData with ranking-specific column mappings (user_id, item_id, prediction, target, recommendations_type) from the data definition's ranking task configuration.
Default Tests:
All SingleValueMetric subclasses in this module define _default_tests_with_reference() returning eq(Reference(relative=0.1)), which tests that the current value is within 10% of the reference value.
Usage
Use this module when:
- Evaluating recommender system or ranking model quality.
- Monitoring recommendation diversity, novelty, serendipity, and personalization.
- Analyzing popularity and group bias in recommendations.
- Inspecting individual recommendation cases for debugging.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File:
src/evidently/metrics/recsys.py
Signature
class TopKBase(DataframeMetric):
k: int
min_rel_score: Optional[int] = None
no_feedback_users: bool = False
ranking_name: str = "default"
class NDCG(TopKBase): ...
class MRR(TopKBase): ...
class HitRate(TopKBase): ...
class MAP(TopKBase): ...
class RecallTopK(TopKBase): ...
class PrecisionTopK(TopKBase): ...
class FBetaTopK(TopKBase):
beta: Optional[float] = 1.0
class ScoreDistribution(SingleValueMetric):
k: int
ranking_name: str = "default"
class PopularityBiasMetric(SingleValueMetric):
k: int
normalize_arp: bool = False
ranking_name: str = "default"
metric: Literal["arp", "coverage", "gini"] = "arp"
class Personalization(SingleValueMetric):
k: int
ranking_name: str = "default"
class Diversity(SingleValueMetric):
k: int
item_features: List[str]
ranking_name: str = "default"
class Serendipity(SingleValueMetric):
k: int
item_features: List[str]
ranking_name: str = "default"
class Novelty(SingleValueMetric):
k: int
ranking_name: str = "default"
class ItemBias(Metric):
k: int
column_name: str
distribution: Literal["default", "train"] = "default"
ranking_name: str = "default"
class UserBias(Metric):
column_name: str
distribution: Literal["default", "train"] = "default"
ranking_name: str = "default"
class RecCasesTable(DataframeMetric):
user_ids: Optional[List[Union[int, str]]] = None
display_features: Optional[List[str]] = None
ranking_name: str = "default"
Import
from evidently.metrics.recsys import (
NDCG,
MRR,
HitRate,
MAP,
RecallTopK,
PrecisionTopK,
FBetaTopK,
ScoreDistribution,
PopularityBiasMetric,
Personalization,
Diversity,
Serendipity,
Novelty,
ItemBias,
UserBias,
RecCasesTable,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| k | int | Yes (most metrics) | Number of top items to consider in the ranking |
| ranking_name | str | No | Name of the ranking task in the data definition (default: "default") |
| min_rel_score | Optional[int] | No | Minimum relevance score threshold for considering items as relevant |
| no_feedback_users | bool | No | Whether to include users with no feedback (default: False) |
| beta | Optional[float] | No (FBetaTopK) | Beta parameter for F-beta score (default: 1.0) |
| item_features | List[str] | Yes (Diversity, Serendipity) | Feature columns for diversity/serendipity calculation |
| column_name | str | Yes (ItemBias, UserBias) | Column containing group/category information |
| distribution | Literal["default", "train"] | No | Distribution source for bias metrics (default: "default") |
| metric | Literal["arp", "coverage", "gini"] | No (PopularityBiasMetric) | Popularity bias metric type (default: "arp") |
| user_ids | Optional[List[Union[int, str]]] | No | Specific user IDs for RecCasesTable |
| display_features | Optional[List[str]] | No | Feature columns to display in RecCasesTable |
Outputs
| Name | Type | Description |
|---|---|---|
| Top-K metrics | DataframeValue | DataFrame with columns "rank" (1-based) and "value" for each rank position |
| SingleValue metrics | SingleValue | Single numeric value (entropy, ARP, coverage, Gini, personalization, diversity, serendipity, novelty) |
| Bias metrics | DataframeValue | DataFrame with columns "x" (bin centers) and "y" (counts) representing distribution |
| RecCasesTable | DataframeValue | DataFrame with recommendation details per user including user_id, item_id, prediction scores, and display features |
Usage Examples
Basic Ranking Evaluation
from evidently.core.report import Report
from evidently.metrics.recsys import NDCG, MRR, HitRate, MAP
report = Report([
NDCG(k=10),
MRR(k=10),
HitRate(k=10),
MAP(k=10),
])
snapshot = report.run(current_dataset, reference_dataset)
Beyond-Accuracy Metrics
from evidently.metrics.recsys import (
Diversity, Novelty, Serendipity, Personalization, ScoreDistribution
)
report = Report([
Diversity(k=10, item_features=["genre", "category"]),
Novelty(k=10),
Serendipity(k=10, item_features=["genre", "category"]),
Personalization(k=10),
ScoreDistribution(k=10),
])
snapshot = report.run(current_dataset, reference_dataset)
Popularity and Bias Analysis
from evidently.metrics.recsys import PopularityBiasMetric, ItemBias, UserBias
report = Report([
PopularityBiasMetric(k=10, metric="gini"),
ItemBias(k=10, column_name="category"),
UserBias(column_name="age_group"),
])
snapshot = report.run(current_dataset, reference_dataset)
Recommendation Cases Inspection
from evidently.metrics.recsys import RecCasesTable
report = Report([
RecCasesTable(
user_ids=["user_001", "user_002", "user_003"],
display_features=["title", "genre", "rating"],
),
])
snapshot = report.run(current_dataset, None)
Full Recsys Report with Custom Ranking Task
from evidently.core.datasets import DataDefinition, Dataset, Recsys
from evidently.core.report import Report
from evidently.metrics.recsys import NDCG, MRR, FBetaTopK, PrecisionTopK, RecallTopK
dataset = Dataset.from_pandas(
df,
data_definition=DataDefinition(
numerical_columns=["target", "prediction"],
ranking=[Recsys(name="my_ranking")]
)
)
report = Report([
NDCG(k=5, ranking_name="my_ranking"),
MRR(k=5, ranking_name="my_ranking"),
FBetaTopK(k=5, beta=0.5, ranking_name="my_ranking"),
PrecisionTopK(k=5, ranking_name="my_ranking"),
RecallTopK(k=5, ranking_name="my_ranking"),
])
snapshot = report.run(dataset, None)
Related Pages
- Environment:Evidentlyai_Evidently_Python_Core_Environment
- Implementation:Evidentlyai_Evidently_Metric_Types -- Provides the base classes (SingleValueMetric, DataframeMetric, Metric, SingleValueCalculation, DataframeValue) used by all recsys metrics
- Implementation:Evidentlyai_Evidently_Pydantic_Utils -- Provides EvidentlyBaseModel and fingerprinting infrastructure used by metric configurations