Implementation:Evidentlyai Evidently Legacy Data Quality Calculations
| Knowledge Sources | |
|---|---|
| Domains | ML Monitoring, Data Quality |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Provides calculation functions for data quality statistics, including per-feature descriptive statistics, correlation analysis (Pearson, Spearman, Kendall, Cramer's V), and distribution computation for the legacy Evidently pipeline.
Description
This module computes comprehensive data quality metrics for individual features and entire datasets. It is used by Evidently's legacy data quality metrics to generate statistics about missing values, unique counts, distributions, and inter-feature correlations.
Data models:
- FeatureQualityStats -- A dataclass capturing quality statistics for a single feature. Metrics vary by feature type:
- All types: count, missing_count, missing_percentage, unique_count, unique_percentage, most_common_value, most_common_value_percentage.
- Numeric: infinite_count, infinite_percentage, min, max, mean, std, percentile_25, percentile_50, percentile_75.
- Categorical: new_in_current_values_count, unused_in_current_values_count (for reference-current comparison).
- Datetime: max and min cast to string representation.
- DataQualityStats -- Aggregates FeatureQualityStats by feature category: num_features_stats, cat_features_stats, datetime_features_stats, target_stats, and prediction_stats. Supports dictionary-style access and a get_all_features() method.
Key functions:
- get_rows_count -- Returns the number of rows in a DataFrame or Series.
- get_features_stats -- Computes FeatureQualityStats for a single feature Series given its column type.
- calculate_data_quality_stats -- Orchestrates quality stat computation across all features in a dataset, categorized by feature type and role (target, prediction).
- prepare_data_for_plots -- Relabels categorical data for plotting, capping at a maximum number of categories.
- _select_features_for_corr -- Selects numerical features (for Pearson/Spearman/Kendall) and categorical features (for Cramer's V) that have more than one unique value.
- _cramer_v -- Computes Cramer's V association measure between two categorical Series using chi-squared contingency.
- get_pairwise_correlation -- Computes a symmetric pairwise correlation matrix using a given correlation function.
- _calculate_correlations -- Dispatches correlation computation based on method (pearson, spearman, kendall, cramer_v).
- calculate_correlations -- Computes all four correlation matrices for a dataset.
- calculate_cramer_v_correlation -- Computes Cramer's V correlation for one column against a list of other columns, returning a ColumnCorrelations result.
- calculate_category_correlation -- Computes Cramer's V for a category column against all features.
- calculate_numerical_correlation -- Computes Pearson, Spearman, and Kendall correlations for a numerical column against all features.
- calculate_column_distribution -- Computes value counts for a column, returning a dictionary distribution.
- get_corr_method -- Utility to select the correlation method with fallback logic.
Usage
Used internally by Evidently's DataQualityMetric, ColumnSummaryMetric, and ColumnCorrelationsMetric. Can also be called directly for standalone data quality analysis of pandas DataFrames.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File:
src/evidently/legacy/calculations/data_quality.py
Signature
@dataclasses.dataclass
class FeatureQualityStats:
feature_type: str
number_of_rows: int = 0
count: int = 0
infinite_count: Optional[int] = None
infinite_percentage: Optional[float] = None
missing_count: Optional[int] = None
missing_percentage: Optional[float] = None
unique_count: Optional[int] = None
unique_percentage: Optional[float] = None
percentile_25: Optional[float] = None
percentile_50: Optional[float] = None
percentile_75: Optional[float] = None
max: Optional[Union[int, float, bool, str]] = None
min: Optional[Union[int, float, bool, str]] = None
mean: Optional[float] = None
std: Optional[float] = None
most_common_value: Optional[Union[int, float, bool, str]] = None
most_common_value_percentage: Optional[float] = None
...
@dataclasses.dataclass
class DataQualityStats:
rows_count: int
num_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
cat_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
datetime_features_stats: Optional[Dict[str, FeatureQualityStats]] = None
target_stats: Optional[Dict[str, FeatureQualityStats]] = None
prediction_stats: Optional[Dict[str, FeatureQualityStats]] = None
def get_rows_count(data: Union[pd.DataFrame, pd.Series]) -> int: ...
def get_features_stats(feature: pd.Series, feature_type: ColumnType) -> FeatureQualityStats: ...
def calculate_data_quality_stats(
dataset: pd.DataFrame, columns: DatasetColumns, task: Optional[str]
) -> DataQualityStats: ...
def calculate_correlations(
dataset: pd.DataFrame, data_definition: DataDefinition, add_text_columns: Optional[list] = None
) -> Dict: ...
def calculate_cramer_v_correlation(
column_name: str, dataset: pd.DataFrame, columns: List[str]
) -> ColumnCorrelations: ...
def calculate_category_correlation(
column_display_name: str, column: pd.Series, features: pd.DataFrame,
) -> List[ColumnCorrelations]: ...
def calculate_numerical_correlation(
column_display_name: str, column: Optional[pd.Series], features: pd.DataFrame,
) -> List[ColumnCorrelations]: ...
def calculate_column_distribution(column: pd.Series, column_type: str) -> ColumnDistribution: ...
Import
from evidently.legacy.calculations.data_quality import (
FeatureQualityStats,
DataQualityStats,
get_rows_count,
get_features_stats,
calculate_data_quality_stats,
calculate_correlations,
calculate_cramer_v_correlation,
calculate_category_correlation,
calculate_numerical_correlation,
calculate_column_distribution,
prepare_data_for_plots,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| feature | pd.Series |
Yes (for get_features_stats) | A single feature column to compute statistics for. |
| feature_type | ColumnType |
Yes | The type of the feature (Numerical, Categorical, Datetime). |
| dataset | pd.DataFrame |
Yes (for dataset-level functions) | Full dataset for quality or correlation computation. |
| columns | DatasetColumns |
Yes (for calculate_data_quality_stats) | Metadata describing column roles and types. |
| data_definition | DataDefinition |
Yes (for calculate_correlations) | Full column definitions for selecting correlation features. |
| task | Optional[str] |
No | Task type ("classification" or "regression") to determine target/prediction treatment. |
| column | pd.Series |
Yes (for correlation functions) | The column to compute correlations for. |
| features | pd.DataFrame |
Yes (for correlation functions) | DataFrame of features to correlate against. |
| max_categories | Optional[int] |
No | Maximum number of categories before relabeling (default 5). |
Outputs
| Name | Type | Description |
|---|---|---|
| FeatureQualityStats | FeatureQualityStats |
Comprehensive quality statistics for a single feature. |
| DataQualityStats | DataQualityStats |
Aggregated quality statistics for all features in a dataset. |
| correlations | Dict[str, pd.DataFrame] |
Dictionary mapping correlation method names to correlation matrices. |
| ColumnCorrelations | ColumnCorrelations |
Correlation values for one column against others, with method name. |
| ColumnDistribution | Dict |
Value count distribution for a column. |
Usage Examples
import pandas as pd
from evidently.legacy.calculations.data_quality import (
get_features_stats,
calculate_data_quality_stats,
calculate_correlations,
)
from evidently.legacy.core import ColumnType
from evidently.legacy.metric_results import DatasetColumns, DatasetUtilityColumns
# Get stats for a single numeric feature
series = pd.Series([1.0, 2.0, None, 4.0, 5.0])
stats = get_features_stats(series, ColumnType.Numerical)
print(f"Count: {stats.count}, Missing: {stats.missing_count}, Mean: {stats.mean}")
# Get quality stats for an entire dataset
columns = DatasetColumns(
utility_columns=DatasetUtilityColumns(date=None, id=None, target="target", prediction=None),
target_type=None,
num_feature_names=["age", "income"],
cat_feature_names=["gender"],
text_feature_names=[],
datetime_feature_names=[],
target_names=None,
task="classification",
)
data = pd.DataFrame({
"age": [25, 30, 35, None, 45],
"income": [50000, 60000, 70000, 80000, 90000],
"gender": ["M", "F", "M", "F", "M"],
"target": [0, 1, 0, 1, 0],
})
quality = calculate_data_quality_stats(data, columns, task="classification")
print(f"Rows: {quality.rows_count}")
for name, feat_stats in quality.num_features_stats.items():
print(f" {name}: mean={feat_stats.mean}, missing={feat_stats.missing_count}")
Related Pages
- Environment:Evidentlyai_Evidently_Python_Core_Environment
- Evidentlyai_Evidently_Legacy_Base_Metric -- Base classes used by data quality metrics
- Evidentlyai_Evidently_Legacy_Metric_Results -- ColumnCorrelations, Distribution, and related result models
- Evidentlyai_Evidently_Legacy_Data_Drift_Calculations -- Complementary drift calculations