Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Legacy Data Drift Calculations

From Leeroopedia
Knowledge Sources
Domains ML Monitoring, Data Drift
Last Updated 2026-02-14 12:00 GMT

Overview

Implements the data drift detection calculations for the legacy Evidently pipeline, providing per-column drift analysis using configurable statistical tests and aggregating results into dataset-level drift metrics.

Description

This module is the computational core of Evidently's legacy data drift detection. It compares distributions of features between a reference and a current dataset, applying statistical tests to determine whether drift has occurred.

Result models defined in this module:

  • DriftStatsField -- Holds per-dataset (current or reference) statistics for a single column: distribution, small distribution (histogram bins), characteristic examples and words (for text), and correlations.
  • ColumnDataDriftMetrics -- Extends ColumnMetricResult with drift-specific fields: the statistical test name, threshold, drift score, whether drift was detected, and DriftStatsField objects for both current and reference data. Also carries optional scatter/scatter-aggregate plot data.
  • DatasetDrift -- A plain dataclass summarizing dataset-level drift: count of drifted columns, drift score, and a boolean drift flag.
  • DatasetDriftMetrics -- A MetricResult aggregating all column-level drift results plus overall dataset drift statistics.

Key functions:

  • get_one_column_drift -- The main per-column calculation. It:
    • Validates the column exists in both datasets and has a supported type (Numerical, Categorical, or Text).
    • Selects the appropriate statistical test based on options (target-specific or feature-specific).
    • Cleans NaN and infinite values from both datasets.
    • Runs the selected statistical test to produce a drift score and drift detection flag.
    • For numerical columns: computes correlations, histograms, and scatter/time-series plots.
    • For categorical columns: computes value count distributions.
    • For text columns with detected drift: extracts characteristic examples and words.
    • Returns a fully populated ColumnDataDriftMetrics object.
  • get_drift_for_columns -- Orchestrates drift detection across all columns:
    • Normalizes prediction columns for classification tasks.
    • Identifies all columns to test (target, prediction, numerical, categorical, text features).
    • Recognizes column types using the combined reference+current dataset.
    • Pre-computes correlation matrices for numerical columns (using optimized numpy where possible).
    • Iterates over columns, calling get_one_column_drift for each.
    • Aggregates into DatasetDriftMetrics using get_dataset_drift.
  • get_dataset_drift -- Computes dataset-level drift by counting columns where drift was detected and comparing the share against a threshold.
  • ensure_prediction_column_is_string -- Normalizes prediction columns: converts probability columns to predicted labels for multiclass (argmax) or binary (threshold) classification.

Usage

Used internally by Evidently's DataDriftTable, DatasetDriftMetric, and related drift metrics. Can be called directly to compute drift between two pandas DataFrames given a column mapping and drift options.

Code Reference

Source Location

Signature

class DriftStatsField(MetricResult):
    distribution: Optional[Distribution]
    characteristic_examples: Optional[Examples]
    characteristic_words: Optional[Words]
    small_distribution: Optional[DistributionIncluded]
    correlations: Optional[Dict[str, float]]

class ColumnDataDriftMetrics(ColumnMetricResult):
    stattest_name: str
    stattest_threshold: Optional[float]
    drift_score: Numeric
    drift_detected: bool
    current: DriftStatsField
    reference: DriftStatsField
    scatter: Optional[Union[ScatterField, ScatterAggField]]

@dataclass
class DatasetDrift:
    number_of_drifted_columns: int
    dataset_drift_score: float
    dataset_drift: bool

class DatasetDriftMetrics(MetricResult):
    number_of_columns: int
    number_of_drifted_columns: int
    share_of_drifted_columns: float
    dataset_drift: bool
    drift_by_columns: Dict[str, ColumnDataDriftMetrics]
    options: DataDriftOptions
    dataset_columns: DatasetColumns

def get_one_column_drift(
    *, current_data: pd.DataFrame, reference_data: pd.DataFrame,
    column_name: str, options: DataDriftOptions,
    dataset_columns: DatasetColumns, column_type: ColumnType,
    agg_data: bool, num_correlations: Optional[tuple] = None,
    is_contains_nans: Optional[Tuple[pd.Series, pd.Series]] = None,
) -> ColumnDataDriftMetrics: ...

def get_drift_for_columns(
    *, current_data: pd.DataFrame, reference_data: pd.DataFrame,
    dataset_columns: DatasetColumns, data_drift_options: DataDriftOptions,
    drift_share_threshold: Optional[float] = None,
    columns: Optional[List[str]] = None, agg_data: bool,
) -> DatasetDriftMetrics: ...

def get_dataset_drift(
    drift_metrics: Dict[str, ColumnDataDriftMetrics], drift_share: float = 0.5
) -> DatasetDrift: ...

def ensure_prediction_column_is_string(
    *, prediction_column: Optional[Union[str, Sequence]],
    current_data: pd.DataFrame, reference_data: pd.DataFrame,
    threshold: float = 0.5,
) -> Optional[str]: ...

Import

from evidently.legacy.calculations.data_drift import (
    get_one_column_drift,
    get_drift_for_columns,
    get_dataset_drift,
    ensure_prediction_column_is_string,
    ColumnDataDriftMetrics,
    DatasetDriftMetrics,
    DatasetDrift,
    DriftStatsField,
)

I/O Contract

Inputs

Name Type Required Description
current_data pd.DataFrame Yes The current (production) dataset to check for drift.
reference_data pd.DataFrame Yes The reference (baseline) dataset to compare against.
column_name str Yes (for get_one_column_drift) Name of the column to analyze.
column_type ColumnType Yes (for get_one_column_drift) Type of the column (Numerical, Categorical, or Text).
options DataDriftOptions Yes Configuration for statistical tests, thresholds, and feature-specific overrides.
dataset_columns DatasetColumns Yes Metadata describing all dataset columns and their roles.
agg_data bool Yes Whether to aggregate scatter data (True) or show raw data points (False).
drift_share_threshold Optional[float] No Fraction of drifted columns required to flag dataset drift (default from options or 0.5).
columns Optional[List[str]] No Explicit list of columns to test; if None, all relevant columns are tested.

Outputs

Name Type Description
ColumnDataDriftMetrics ColumnDataDriftMetrics Per-column drift results including score, detection flag, distributions, and scatter data.
DatasetDriftMetrics DatasetDriftMetrics Aggregated drift results for the full dataset.
DatasetDrift DatasetDrift Lightweight summary of dataset-level drift.

Usage Examples

import pandas as pd
from evidently.legacy.calculations.data_drift import get_drift_for_columns
from evidently.legacy.options.data_drift import DataDriftOptions
from evidently.legacy.metric_results import DatasetColumns, DatasetUtilityColumns

# Define columns metadata
columns = DatasetColumns(
    utility_columns=DatasetUtilityColumns(date=None, id=None, target=None, prediction=None),
    target_type=None,
    num_feature_names=["feature_1", "feature_2"],
    cat_feature_names=["category_a"],
    text_feature_names=[],
    datetime_feature_names=[],
    target_names=None,
    task=None,
)

reference = pd.DataFrame({
    "feature_1": [1.0, 2.0, 3.0, 4.0, 5.0],
    "feature_2": [10, 20, 30, 40, 50],
    "category_a": ["a", "b", "a", "b", "a"],
})

current = pd.DataFrame({
    "feature_1": [2.0, 4.0, 6.0, 8.0, 10.0],
    "feature_2": [15, 25, 35, 45, 55],
    "category_a": ["a", "a", "b", "b", "c"],
})

drift_result = get_drift_for_columns(
    current_data=current,
    reference_data=reference,
    dataset_columns=columns,
    data_drift_options=DataDriftOptions(),
    agg_data=True,
)

print(f"Dataset drift detected: {drift_result.dataset_drift}")
print(f"Drifted columns: {drift_result.number_of_drifted_columns}/{drift_result.number_of_columns}")
for col_name, col_drift in drift_result.drift_by_columns.items():
    print(f"  {col_name}: score={col_drift.drift_score}, drifted={col_drift.drift_detected}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment