Heuristic:Evidentlyai Evidently Statistical Test Auto Selection
| Knowledge Sources | |
|---|---|
| Domains | Data_Drift, Statistics |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Decision framework for automatically selecting the optimal statistical test for data drift detection based on sample size, feature type, and number of unique values.
Description
Evidently uses a sophisticated decision tree to automatically select the most appropriate statistical test when no explicit test is specified by the user. The selection considers three factors: (1) the feature type (Numerical, Categorical, or Text), (2) the sample size (threshold at 1000 observations), and (3) the number of unique values (threshold at 5 for numerical features, threshold at 2 for binary detection). This heuristic encodes domain expertise about which statistical tests perform best under different data characteristics.
Usage
This heuristic applies whenever you run a data drift report or metric without explicitly specifying a statistical test via `DataDriftOptions`. Understanding this selection logic is essential for interpreting drift results and knowing when to override the default test choice.
The Insight (Rule of Thumb)
- Small samples (n <= 1000), Numerical, > 5 unique values: Kolmogorov-Smirnov test (`ks`) — default threshold: 0.05
- Small samples (n <= 1000), Numerical, <= 5 unique values, > 2 unique: Chi-squared test (`chisquare`) — default threshold: 0.05
- Small samples (n <= 1000), Numerical, <= 2 unique values: Z-test (`z`) — default threshold: 0.05
- Small samples (n <= 1000), Categorical, > 2 unique values: Chi-squared test (`chisquare`) — default threshold: 0.05
- Small samples (n <= 1000), Categorical, <= 2 unique values: Z-test (`z`) — default threshold: 0.05
- Large samples (n > 1000), Numerical, > 5 unique values: Wasserstein distance (`wasserstein`) — default threshold: 0.1
- Large samples (n > 1000), Numerical, <= 5 unique values: Jensen-Shannon divergence (`jensenshannon`) — default threshold: 0.1
- Large samples (n > 1000), Categorical: Jensen-Shannon divergence (`jensenshannon`) — default threshold: 0.1
- Text, n <= 1000: Percentage-based text content drift (`perc_text_content_drift`) — default threshold: 0.55
- Text, n > 1000: Absolute text content drift (`abs_text_content_drift`) — default threshold: 0.55
- Trade-off: The 1000-sample threshold balances statistical power (parametric tests need enough data) with computational efficiency (distance-based tests scale better for large data).
Reasoning
The selection follows established statistical best practices:
Small samples (n <= 1000): Classical hypothesis tests (KS, Chi-squared, Z) are chosen because they have well-understood p-value distributions and Type I error control at small sample sizes. KS is the default for continuous numerical features because it is non-parametric and detects any kind of distributional shift. Chi-squared and Z-tests are used for low-cardinality features because they are designed for discrete distributions.
Large samples (n > 1000): Distance-based measures (Wasserstein, Jensen-Shannon) are preferred because classical hypothesis tests become overly sensitive at large sample sizes — they detect statistically significant but practically insignificant drift. Wasserstein distance measures the "earth mover's distance" which has an intuitive interpretation as the cost of transforming one distribution into another. Jensen-Shannon divergence is a symmetric, bounded measure suitable for discrete distributions.
Text features: Use a domain classifier approach (ROC AUC of a classifier trained to distinguish reference from current data). For small text datasets, percentage-based drift is more stable; for large datasets, absolute measurement is preferred.
The unique value count threshold of 5 distinguishes between truly continuous and quasi-categorical numerical features (e.g., a rating from 1-5 should be treated categorically). The binary threshold of 2 identifies features where a simpler proportions test (Z-test) is more appropriate than Chi-squared.
Code Evidence
Default statistical test selection from `src/evidently/legacy/calculations/stattests/registry.py:137-160`:
def _get_default_stattest(reference_data, current_data, feature_type):
n_values = pd.concat([reference_data, current_data]).nunique()
if feature_type == ColumnType.Text:
if reference_data.shape[0] > 1000:
return stattests.abs_text_content_drift_stat_test
return stattests.perc_text_content_drift_stat_test
elif reference_data.shape[0] <= 1000:
if feature_type == ColumnType.Numerical:
if n_values <= 5:
return stattests.chi_stat_test if n_values > 2 else stattests.z_stat_test
elif n_values > 5:
return stattests.ks_stat_test
elif feature_type == ColumnType.Categorical:
return stattests.chi_stat_test if n_values > 2 else stattests.z_stat_test
elif reference_data.shape[0] > 1000:
if feature_type == ColumnType.Numerical:
if n_values <= 5:
return stattests.jensenshannon_stat_test
elif n_values > 5:
return stattests.wasserstein_stat_test
elif feature_type == ColumnType.Categorical:
return stattests.jensenshannon_stat_test
Default threshold of 0.05 from `src/evidently/legacy/calculations/stattests/registry.py:38`:
@dataclasses.dataclass
class StatTest:
name: str
display_name: str
allowed_feature_types: List[ColumnType]
default_threshold: float = 0.05
Text drift classifier parameters from `src/evidently/legacy/utils/data_drift_utils.py:105-114`:
def roc_auc_domain_classifier(X_train, X_test, y_train, y_test):
pipeline = Pipeline([
("vectorization", TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")),
("classification", SGDClassifier(alpha=0.0001, max_iter=50, penalty="l1",
loss="modified_huber", random_state=42)),
])