Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Legacy Data Quality Calculations

From Leeroopedia
Knowledge Sources
Domains ML Monitoring, Data Quality
Last Updated 2026-02-14 12:00 GMT

Overview

Provides calculation functions for data quality statistics, including per-feature descriptive statistics, correlation analysis (Pearson, Spearman, Kendall, Cramer's V), and distribution computation for the legacy Evidently pipeline.

Description

This module computes comprehensive data quality metrics for individual features and entire datasets. It is used by Evidently's legacy data quality metrics to generate statistics about missing values, unique counts, distributions, and inter-feature correlations.

Data models:

  • FeatureQualityStats -- A dataclass capturing quality statistics for a single feature. Metrics vary by feature type:
    • All types: count, missing_count, missing_percentage, unique_count, unique_percentage, most_common_value, most_common_value_percentage.
    • Numeric: infinite_count, infinite_percentage, min, max, mean, std, percentile_25, percentile_50, percentile_75.
    • Categorical: new_in_current_values_count, unused_in_current_values_count (for reference-current comparison).
    • Datetime: max and min cast to string representation.
  • DataQualityStats -- Aggregates FeatureQualityStats by feature category: num_features_stats, cat_features_stats, datetime_features_stats, target_stats, and prediction_stats. Supports dictionary-style access and a get_all_features() method.

Key functions:

  • get_rows_count -- Returns the number of rows in a DataFrame or Series.
  • get_features_stats -- Computes FeatureQualityStats for a single feature Series given its column type.
  • calculate_data_quality_stats -- Orchestrates quality stat computation across all features in a dataset, categorized by feature type and role (target, prediction).
  • prepare_data_for_plots -- Relabels categorical data for plotting, capping at a maximum number of categories.
  • _select_features_for_corr -- Selects numerical features (for Pearson/Spearman/Kendall) and categorical features (for Cramer's V) that have more than one unique value.
  • _cramer_v -- Computes Cramer's V association measure between two categorical Series using chi-squared contingency.
  • get_pairwise_correlation -- Computes a symmetric pairwise correlation matrix using a given correlation function.
  • _calculate_correlations -- Dispatches correlation computation based on method (pearson, spearman, kendall, cramer_v).
  • calculate_correlations -- Computes all four correlation matrices for a dataset.
  • calculate_cramer_v_correlation -- Computes Cramer's V correlation for one column against a list of other columns, returning a ColumnCorrelations result.
  • calculate_category_correlation -- Computes Cramer's V for a category column against all features.
  • calculate_numerical_correlation -- Computes Pearson, Spearman, and Kendall correlations for a numerical column against all features.
  • calculate_column_distribution -- Computes value counts for a column, returning a dictionary distribution.
  • get_corr_method -- Utility to select the correlation method with fallback logic.

Usage

Used internally by Evidently's DataQualityMetric, ColumnSummaryMetric, and ColumnCorrelationsMetric. Can also be called directly for standalone data quality analysis of pandas DataFrames.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class FeatureQualityStats:
    feature_type: str
    number_of_rows: int = 0
    count: int = 0
    infinite_count: Optional[int] = None
    infinite_percentage: Optional[float] = None
    missing_count: Optional[int] = None
    missing_percentage: Optional[float] = None
    unique_count: Optional[int] = None
    unique_percentage: Optional[float] = None
    percentile_25: Optional[float] = None
    percentile_50: Optional[float] = None
    percentile_75: Optional[float] = None
    max: Optional[Union[int, float, bool, str]] = None
    min: Optional[Union[int, float, bool, str]] = None
    mean: Optional[float] = None
    std: Optional[float] = None
    most_common_value: Optional[Union[int, float, bool, str]] = None
    most_common_value_percentage: Optional[float] = None
    ...

@dataclasses.dataclass
class DataQualityStats:
    rows_count: int
    num_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
    cat_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
    datetime_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
    target_stats: Optional[Dict[str, FeatureQualityStats]] = None
    prediction_stats: Optional[Dict[str, FeatureQualityStats]] = None

def get_rows_count(data: Union[pd.DataFrame, pd.Series]) -> int: ...

def get_features_stats(feature: pd.Series, feature_type: ColumnType) -> FeatureQualityStats: ...

def calculate_data_quality_stats(
    dataset: pd.DataFrame, columns: DatasetColumns, task: Optional[str]
) -> DataQualityStats: ...

def calculate_correlations(
    dataset: pd.DataFrame, data_definition: DataDefinition, add_text_columns: Optional[list] = None
) -> Dict: ...

def calculate_cramer_v_correlation(
    column_name: str, dataset: pd.DataFrame, columns: List[str]
) -> ColumnCorrelations: ...

def calculate_category_correlation(
    column_display_name: str, column: pd.Series, features: pd.DataFrame,
) -> List[ColumnCorrelations]: ...

def calculate_numerical_correlation(
    column_display_name: str, column: Optional[pd.Series], features: pd.DataFrame,
) -> List[ColumnCorrelations]: ...

def calculate_column_distribution(column: pd.Series, column_type: str) -> ColumnDistribution: ...

Import

from evidently.legacy.calculations.data_quality import (
    FeatureQualityStats,
    DataQualityStats,
    get_rows_count,
    get_features_stats,
    calculate_data_quality_stats,
    calculate_correlations,
    calculate_cramer_v_correlation,
    calculate_category_correlation,
    calculate_numerical_correlation,
    calculate_column_distribution,
    prepare_data_for_plots,
)

I/O Contract

Inputs

Name Type Required Description
feature pd.Series Yes (for get_features_stats) A single feature column to compute statistics for.
feature_type ColumnType Yes The type of the feature (Numerical, Categorical, Datetime).
dataset pd.DataFrame Yes (for dataset-level functions) Full dataset for quality or correlation computation.
columns DatasetColumns Yes (for calculate_data_quality_stats) Metadata describing column roles and types.
data_definition DataDefinition Yes (for calculate_correlations) Full column definitions for selecting correlation features.
task Optional[str] No Task type ("classification" or "regression") to determine target/prediction treatment.
column pd.Series Yes (for correlation functions) The column to compute correlations for.
features pd.DataFrame Yes (for correlation functions) DataFrame of features to correlate against.
max_categories Optional[int] No Maximum number of categories before relabeling (default 5).

Outputs

Name Type Description
FeatureQualityStats FeatureQualityStats Comprehensive quality statistics for a single feature.
DataQualityStats DataQualityStats Aggregated quality statistics for all features in a dataset.
correlations Dict[str, pd.DataFrame] Dictionary mapping correlation method names to correlation matrices.
ColumnCorrelations ColumnCorrelations Correlation values for one column against others, with method name.
ColumnDistribution Dict Value count distribution for a column.

Usage Examples

import pandas as pd
from evidently.legacy.calculations.data_quality import (
    get_features_stats,
    calculate_data_quality_stats,
    calculate_correlations,
)
from evidently.legacy.core import ColumnType
from evidently.legacy.metric_results import DatasetColumns, DatasetUtilityColumns

# Get stats for a single numeric feature
series = pd.Series([1.0, 2.0, None, 4.0, 5.0])
stats = get_features_stats(series, ColumnType.Numerical)
print(f"Count: {stats.count}, Missing: {stats.missing_count}, Mean: {stats.mean}")

# Get quality stats for an entire dataset
columns = DatasetColumns(
    utility_columns=DatasetUtilityColumns(date=None, id=None, target="target", prediction=None),
    target_type=None,
    num_feature_names=["age", "income"],
    cat_feature_names=["gender"],
    text_feature_names=[],
    datetime_feature_names=[],
    target_names=None,
    task="classification",
)

data = pd.DataFrame({
    "age": [25, 30, 35, None, 45],
    "income": [50000, 60000, 70000, 80000, 90000],
    "gender": ["M", "F", "M", "F", "M"],
    "target": [0, 1, 0, 1, 0],
})

quality = calculate_data_quality_stats(data, columns, task="classification")
print(f"Rows: {quality.rows_count}")
for name, feat_stats in quality.num_features_stats.items():
    print(f"  {name}: mean={feat_stats.mean}, missing={feat_stats.missing_count}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment