Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Recsys Metrics

From Leeroopedia
Revision as of 12:29, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Evidentlyai_Evidently_Recsys_Metrics.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Recommender Systems, ML Monitoring, Ranking Metrics
Last Updated 2026-02-14 12:00 GMT

Overview

Implements recommender system and ranking evaluation metrics for Evidently's V2 metric framework, providing NDCG, MRR, HitRate, MAP, Recall, Precision, F-beta, diversity, novelty, serendipity, personalization, popularity bias, item/user bias, and recommendation case table metrics.

Description

The recsys module provides a comprehensive suite of metrics for evaluating recommender systems and ranking models. All metrics in this module wrap legacy V1 metric implementations via LegacyMetricCalculation and expose them through the V2 metric type system.

Ranking Quality Metrics (Top-K):

All top-k metrics inherit from TopKBase (a DataframeMetric) and use LegacyTopKCalculation for computation. Each metric returns a DataframeValue containing rank-value pairs.

Metric Class Description Display Name
NDCG Normalized Discounted Cumulative Gain -- measures ranking quality considering position and relevance NDCG@k
MRR Mean Reciprocal Rank -- average reciprocal rank of first relevant item MRR@k
HitRate Hit Rate -- proportion of users with at least one relevant item in top-k HitRate@k
MAP Mean Average Precision -- average precision across all users MAP@k
RecallTopK Recall -- proportion of relevant items found in top-k Recall@k
PrecisionTopK Precision -- proportion of relevant items in top-k Precision@k
FBetaTopK F-beta -- weighted harmonic mean of precision and recall; configurable beta parameter (default: 1.0) F{beta}@k

TopKBase common parameters:

  • k: Number of top items to consider.
  • min_rel_score: Optional minimum relevance score threshold.
  • no_feedback_users: Whether to include users with no feedback (default: False).
  • ranking_name: Name of the ranking task definition (default: "default").

Beyond-Accuracy Metrics:

Metric Class Type Description
ScoreDistribution SingleValueMetric Score distribution entropy -- measures diversity of recommendation scores using entropy
PopularityBiasMetric SingleValueMetric Measures popularity bias using ARP (Average Recommendation Popularity), coverage, or Gini coefficient. Configurable via metric parameter ("arp", "coverage", "gini")
Personalization SingleValueMetric Measures how different recommendations are across users
Diversity SingleValueMetric Measures diversity of items within each user's recommendations based on item_features
Serendipity SingleValueMetric Measures how surprising yet relevant recommendations are, using item_features
Novelty SingleValueMetric Measures how novel (less popular) recommended items are

Bias Analysis Metrics:

Metric Class Type Description
ItemBias Metric (DataframeValue) Measures bias in recommendations toward specific item groups. Requires column_name for group column. Supports "default" or "train" distribution
UserBias Metric (DataframeValue) Measures bias toward specific user groups. Requires column_name for group column. Supports "default" or "train" distribution

Inspection Metrics:

Metric Class Type Description
RecCasesTable DataframeMetric Displays detailed recommendation cases for specific users. Optional user_ids and display_features parameters

Helper Function:

  • _gen_ranking_input_data(context, task_name) -- Generates InputData with ranking-specific column mappings (user_id, item_id, prediction, target, recommendations_type) from the data definition's ranking task configuration.

Default Tests: All SingleValueMetric subclasses in this module define _default_tests_with_reference() returning eq(Reference(relative=0.1)), which tests that the current value is within 10% of the reference value.

Usage

Use this module when:

  • Evaluating recommender system or ranking model quality.
  • Monitoring recommendation diversity, novelty, serendipity, and personalization.
  • Analyzing popularity and group bias in recommendations.
  • Inspecting individual recommendation cases for debugging.

Code Reference

Source Location

Signature

class TopKBase(DataframeMetric):
    k: int
    min_rel_score: Optional[int] = None
    no_feedback_users: bool = False
    ranking_name: str = "default"

class NDCG(TopKBase): ...
class MRR(TopKBase): ...
class HitRate(TopKBase): ...
class MAP(TopKBase): ...
class RecallTopK(TopKBase): ...
class PrecisionTopK(TopKBase): ...
class FBetaTopK(TopKBase):
    beta: Optional[float] = 1.0

class ScoreDistribution(SingleValueMetric):
    k: int
    ranking_name: str = "default"

class PopularityBiasMetric(SingleValueMetric):
    k: int
    normalize_arp: bool = False
    ranking_name: str = "default"
    metric: Literal["arp", "coverage", "gini"] = "arp"

class Personalization(SingleValueMetric):
    k: int
    ranking_name: str = "default"

class Diversity(SingleValueMetric):
    k: int
    item_features: List[str]
    ranking_name: str = "default"

class Serendipity(SingleValueMetric):
    k: int
    item_features: List[str]
    ranking_name: str = "default"

class Novelty(SingleValueMetric):
    k: int
    ranking_name: str = "default"

class ItemBias(Metric):
    k: int
    column_name: str
    distribution: Literal["default", "train"] = "default"
    ranking_name: str = "default"

class UserBias(Metric):
    column_name: str
    distribution: Literal["default", "train"] = "default"
    ranking_name: str = "default"

class RecCasesTable(DataframeMetric):
    user_ids: Optional[List[Union[int, str]]] = None
    display_features: Optional[List[str]] = None
    ranking_name: str = "default"

Import

from evidently.metrics.recsys import (
    NDCG,
    MRR,
    HitRate,
    MAP,
    RecallTopK,
    PrecisionTopK,
    FBetaTopK,
    ScoreDistribution,
    PopularityBiasMetric,
    Personalization,
    Diversity,
    Serendipity,
    Novelty,
    ItemBias,
    UserBias,
    RecCasesTable,
)

I/O Contract

Inputs

Name Type Required Description
k int Yes (most metrics) Number of top items to consider in the ranking
ranking_name str No Name of the ranking task in the data definition (default: "default")
min_rel_score Optional[int] No Minimum relevance score threshold for considering items as relevant
no_feedback_users bool No Whether to include users with no feedback (default: False)
beta Optional[float] No (FBetaTopK) Beta parameter for F-beta score (default: 1.0)
item_features List[str] Yes (Diversity, Serendipity) Feature columns for diversity/serendipity calculation
column_name str Yes (ItemBias, UserBias) Column containing group/category information
distribution Literal["default", "train"] No Distribution source for bias metrics (default: "default")
metric Literal["arp", "coverage", "gini"] No (PopularityBiasMetric) Popularity bias metric type (default: "arp")
user_ids Optional[List[Union[int, str]]] No Specific user IDs for RecCasesTable
display_features Optional[List[str]] No Feature columns to display in RecCasesTable

Outputs

Name Type Description
Top-K metrics DataframeValue DataFrame with columns "rank" (1-based) and "value" for each rank position
SingleValue metrics SingleValue Single numeric value (entropy, ARP, coverage, Gini, personalization, diversity, serendipity, novelty)
Bias metrics DataframeValue DataFrame with columns "x" (bin centers) and "y" (counts) representing distribution
RecCasesTable DataframeValue DataFrame with recommendation details per user including user_id, item_id, prediction scores, and display features

Usage Examples

Basic Ranking Evaluation

from evidently.core.report import Report
from evidently.metrics.recsys import NDCG, MRR, HitRate, MAP

report = Report([
    NDCG(k=10),
    MRR(k=10),
    HitRate(k=10),
    MAP(k=10),
])
snapshot = report.run(current_dataset, reference_dataset)

Beyond-Accuracy Metrics

from evidently.metrics.recsys import (
    Diversity, Novelty, Serendipity, Personalization, ScoreDistribution
)

report = Report([
    Diversity(k=10, item_features=["genre", "category"]),
    Novelty(k=10),
    Serendipity(k=10, item_features=["genre", "category"]),
    Personalization(k=10),
    ScoreDistribution(k=10),
])
snapshot = report.run(current_dataset, reference_dataset)

Popularity and Bias Analysis

from evidently.metrics.recsys import PopularityBiasMetric, ItemBias, UserBias

report = Report([
    PopularityBiasMetric(k=10, metric="gini"),
    ItemBias(k=10, column_name="category"),
    UserBias(column_name="age_group"),
])
snapshot = report.run(current_dataset, reference_dataset)

Recommendation Cases Inspection

from evidently.metrics.recsys import RecCasesTable

report = Report([
    RecCasesTable(
        user_ids=["user_001", "user_002", "user_003"],
        display_features=["title", "genre", "rating"],
    ),
])
snapshot = report.run(current_dataset, None)

Full Recsys Report with Custom Ranking Task

from evidently.core.datasets import DataDefinition, Dataset, Recsys
from evidently.core.report import Report
from evidently.metrics.recsys import NDCG, MRR, FBetaTopK, PrecisionTopK, RecallTopK

dataset = Dataset.from_pandas(
    df,
    data_definition=DataDefinition(
        numerical_columns=["target", "prediction"],
        ranking=[Recsys(name="my_ranking")]
    )
)

report = Report([
    NDCG(k=5, ranking_name="my_ranking"),
    MRR(k=5, ranking_name="my_ranking"),
    FBetaTopK(k=5, beta=0.5, ranking_name="my_ranking"),
    PrecisionTopK(k=5, ranking_name="my_ranking"),
    RecallTopK(k=5, ranking_name="my_ranking"),
])
snapshot = report.run(dataset, None)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment