Heuristic:Scikit learn contrib Imbalanced learn Sampling Before Split Leakage
| Knowledge Sources | |
|---|---|
| Domains | Imbalanced_Classification, Cross_Validation |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Resampling before train-test splitting causes data leakage; always use imblearn Pipeline to resample only within training folds.
Description
A critical and common mistake when working with imbalanced datasets is to resample the entire dataset before splitting it into train and test partitions. This causes data leakage because: (1) the model is tested on artificially balanced data rather than the natural imbalanced distribution, and (2) the resampling algorithm may use information from samples that end up in the test set to generate or select training samples. The correct approach is to use `imblearn.pipeline.Pipeline` which ensures resampling is applied only to the training folds during cross-validation.
Usage
Apply this heuristic whenever combining resampling with cross-validation or train-test evaluation. If you are calling `fit_resample()` manually before `cross_validate()` or `train_test_split()`, you are likely leaking data. Use `imblearn.pipeline.make_pipeline` to wrap the sampler and estimator together.
The Insight (Rule of Thumb)
- Action: Never call `sampler.fit_resample(X, y)` on the full dataset before splitting. Instead, wrap the sampler and classifier in an `imblearn.pipeline.Pipeline`.
- Value: Wrong approach: CV = 0.724, left-out = 0.698 (2.6% gap reveals leakage). Correct approach: CV = 0.732, left-out = 0.727 (consistent results).
- Trade-off: None. Using Pipeline is strictly better; it prevents leakage and adds no computational cost.
Reasoning
The documentation (`doc/common_pitfalls.rst`) provides empirical evidence using the adult census dataset with `RandomUnderSampler`:
Wrong pattern (resample then cross-validate):
- Cross-validation balanced accuracy: 0.724 +/- 0.042
- Left-out test set accuracy: 0.698 +/- 0.014
- The 2.6% gap between CV and test performance reveals over-optimistic results
Correct pattern (Pipeline handles resampling per fold):
- Cross-validation balanced accuracy: 0.732 +/- 0.019
- Left-out test set accuracy: 0.727 +/- 0.008
- Results are consistent, with no sign of data leakage
The Pipeline applies resampling only during `fit()`, never on test folds, preserving the natural class distribution for evaluation.
Code Evidence
Wrong pattern from `doc/common_pitfalls.rst:116-130`:
# WRONG: resampling entire dataset before cross-validation
sampler = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = sampler.fit_resample(X, y)
model = HistGradientBoostingClassifier(random_state=0)
cv_results = cross_validate(
model, X_resampled, y_resampled, scoring="balanced_accuracy",
return_train_score=True, return_estimator=True, n_jobs=-1
)
# Balanced accuracy: 0.724 +/- 0.042 (over-optimistic!)
Correct pattern from `doc/common_pitfalls.rst:157-172`:
# CORRECT: Pipeline ensures resampling only on training folds
from imblearn.pipeline import make_pipeline
model = make_pipeline(
RandomUnderSampler(random_state=0),
HistGradientBoostingClassifier(random_state=0)
)
cv_results = cross_validate(
model, X, y, scoring="balanced_accuracy",
return_train_score=True, return_estimator=True, n_jobs=-1
)
# Balanced accuracy: 0.732 +/- 0.019 (consistent with left-out: 0.727)