Implementation:Evidentlyai Evidently Legacy Data Drift Calculations
| Knowledge Sources | |
|---|---|
| Domains | ML Monitoring, Data Drift |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Implements the data drift detection calculations for the legacy Evidently pipeline, providing per-column drift analysis using configurable statistical tests and aggregating results into dataset-level drift metrics.
Description
This module is the computational core of Evidently's legacy data drift detection. It compares distributions of features between a reference and a current dataset, applying statistical tests to determine whether drift has occurred.
Result models defined in this module:
- DriftStatsField -- Holds per-dataset (current or reference) statistics for a single column: distribution, small distribution (histogram bins), characteristic examples and words (for text), and correlations.
- ColumnDataDriftMetrics -- Extends ColumnMetricResult with drift-specific fields: the statistical test name, threshold, drift score, whether drift was detected, and DriftStatsField objects for both current and reference data. Also carries optional scatter/scatter-aggregate plot data.
- DatasetDrift -- A plain dataclass summarizing dataset-level drift: count of drifted columns, drift score, and a boolean drift flag.
- DatasetDriftMetrics -- A MetricResult aggregating all column-level drift results plus overall dataset drift statistics.
Key functions:
- get_one_column_drift -- The main per-column calculation. It:
- Validates the column exists in both datasets and has a supported type (Numerical, Categorical, or Text).
- Selects the appropriate statistical test based on options (target-specific or feature-specific).
- Cleans NaN and infinite values from both datasets.
- Runs the selected statistical test to produce a drift score and drift detection flag.
- For numerical columns: computes correlations, histograms, and scatter/time-series plots.
- For categorical columns: computes value count distributions.
- For text columns with detected drift: extracts characteristic examples and words.
- Returns a fully populated ColumnDataDriftMetrics object.
- get_drift_for_columns -- Orchestrates drift detection across all columns:
- Normalizes prediction columns for classification tasks.
- Identifies all columns to test (target, prediction, numerical, categorical, text features).
- Recognizes column types using the combined reference+current dataset.
- Pre-computes correlation matrices for numerical columns (using optimized numpy where possible).
- Iterates over columns, calling get_one_column_drift for each.
- Aggregates into DatasetDriftMetrics using get_dataset_drift.
- get_dataset_drift -- Computes dataset-level drift by counting columns where drift was detected and comparing the share against a threshold.
- ensure_prediction_column_is_string -- Normalizes prediction columns: converts probability columns to predicted labels for multiclass (argmax) or binary (threshold) classification.
Usage
Used internally by Evidently's DataDriftTable, DatasetDriftMetric, and related drift metrics. Can be called directly to compute drift between two pandas DataFrames given a column mapping and drift options.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File:
src/evidently/legacy/calculations/data_drift.py
Signature
class DriftStatsField(MetricResult):
distribution: Optional[Distribution]
characteristic_examples: Optional[Examples]
characteristic_words: Optional[Words]
small_distribution: Optional[DistributionIncluded]
correlations: Optional[Dict[str, float]]
class ColumnDataDriftMetrics(ColumnMetricResult):
stattest_name: str
stattest_threshold: Optional[float]
drift_score: Numeric
drift_detected: bool
current: DriftStatsField
reference: DriftStatsField
scatter: Optional[Union[ScatterField, ScatterAggField]]
@dataclass
class DatasetDrift:
number_of_drifted_columns: int
dataset_drift_score: float
dataset_drift: bool
class DatasetDriftMetrics(MetricResult):
number_of_columns: int
number_of_drifted_columns: int
share_of_drifted_columns: float
dataset_drift: bool
drift_by_columns: Dict[str, ColumnDataDriftMetrics]
options: DataDriftOptions
dataset_columns: DatasetColumns
def get_one_column_drift(
*, current_data: pd.DataFrame, reference_data: pd.DataFrame,
column_name: str, options: DataDriftOptions,
dataset_columns: DatasetColumns, column_type: ColumnType,
agg_data: bool, num_correlations: Optional[tuple] = None,
is_contains_nans: Optional[Tuple[pd.Series, pd.Series]] = None,
) -> ColumnDataDriftMetrics: ...
def get_drift_for_columns(
*, current_data: pd.DataFrame, reference_data: pd.DataFrame,
dataset_columns: DatasetColumns, data_drift_options: DataDriftOptions,
drift_share_threshold: Optional[float] = None,
columns: Optional[List[str]] = None, agg_data: bool,
) -> DatasetDriftMetrics: ...
def get_dataset_drift(
drift_metrics: Dict[str, ColumnDataDriftMetrics], drift_share: float = 0.5
) -> DatasetDrift: ...
def ensure_prediction_column_is_string(
*, prediction_column: Optional[Union[str, Sequence]],
current_data: pd.DataFrame, reference_data: pd.DataFrame,
threshold: float = 0.5,
) -> Optional[str]: ...
Import
from evidently.legacy.calculations.data_drift import (
get_one_column_drift,
get_drift_for_columns,
get_dataset_drift,
ensure_prediction_column_is_string,
ColumnDataDriftMetrics,
DatasetDriftMetrics,
DatasetDrift,
DriftStatsField,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| current_data | pd.DataFrame |
Yes | The current (production) dataset to check for drift. |
| reference_data | pd.DataFrame |
Yes | The reference (baseline) dataset to compare against. |
| column_name | str |
Yes (for get_one_column_drift) | Name of the column to analyze. |
| column_type | ColumnType |
Yes (for get_one_column_drift) | Type of the column (Numerical, Categorical, or Text). |
| options | DataDriftOptions |
Yes | Configuration for statistical tests, thresholds, and feature-specific overrides. |
| dataset_columns | DatasetColumns |
Yes | Metadata describing all dataset columns and their roles. |
| agg_data | bool |
Yes | Whether to aggregate scatter data (True) or show raw data points (False). |
| drift_share_threshold | Optional[float] |
No | Fraction of drifted columns required to flag dataset drift (default from options or 0.5). |
| columns | Optional[List[str]] |
No | Explicit list of columns to test; if None, all relevant columns are tested. |
Outputs
| Name | Type | Description |
|---|---|---|
| ColumnDataDriftMetrics | ColumnDataDriftMetrics |
Per-column drift results including score, detection flag, distributions, and scatter data. |
| DatasetDriftMetrics | DatasetDriftMetrics |
Aggregated drift results for the full dataset. |
| DatasetDrift | DatasetDrift |
Lightweight summary of dataset-level drift. |
Usage Examples
import pandas as pd
from evidently.legacy.calculations.data_drift import get_drift_for_columns
from evidently.legacy.options.data_drift import DataDriftOptions
from evidently.legacy.metric_results import DatasetColumns, DatasetUtilityColumns
# Define columns metadata
columns = DatasetColumns(
utility_columns=DatasetUtilityColumns(date=None, id=None, target=None, prediction=None),
target_type=None,
num_feature_names=["feature_1", "feature_2"],
cat_feature_names=["category_a"],
text_feature_names=[],
datetime_feature_names=[],
target_names=None,
task=None,
)
reference = pd.DataFrame({
"feature_1": [1.0, 2.0, 3.0, 4.0, 5.0],
"feature_2": [10, 20, 30, 40, 50],
"category_a": ["a", "b", "a", "b", "a"],
})
current = pd.DataFrame({
"feature_1": [2.0, 4.0, 6.0, 8.0, 10.0],
"feature_2": [15, 25, 35, 45, 55],
"category_a": ["a", "a", "b", "b", "c"],
})
drift_result = get_drift_for_columns(
current_data=current,
reference_data=reference,
dataset_columns=columns,
data_drift_options=DataDriftOptions(),
agg_data=True,
)
print(f"Dataset drift detected: {drift_result.dataset_drift}")
print(f"Drifted columns: {drift_result.number_of_drifted_columns}/{drift_result.number_of_columns}")
for col_name, col_drift in drift_result.drift_by_columns.items():
print(f" {col_name}: score={col_drift.drift_score}, drifted={col_drift.drift_detected}")
Related Pages
- Environment:Evidentlyai_Evidently_Python_Core_Environment
- Evidentlyai_Evidently_Legacy_Base_Metric -- Base classes for metrics that use these calculations
- Evidentlyai_Evidently_Legacy_Metric_Results -- Distribution, ScatterField, and related result models
- Evidentlyai_Evidently_Legacy_Data_Quality_Calculations -- Complementary data quality calculations