Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Scikit learn contrib Imbalanced learn Sampling Before Split Leakage

From Leeroopedia




Knowledge Sources
Domains Imbalanced_Classification, Cross_Validation
Last Updated 2026-02-09 03:00 GMT

Overview

Resampling before train-test splitting causes data leakage; always use imblearn Pipeline to resample only within training folds.

Description

A critical and common mistake when working with imbalanced datasets is to resample the entire dataset before splitting it into train and test partitions. This causes data leakage because: (1) the model is tested on artificially balanced data rather than the natural imbalanced distribution, and (2) the resampling algorithm may use information from samples that end up in the test set to generate or select training samples. The correct approach is to use `imblearn.pipeline.Pipeline` which ensures resampling is applied only to the training folds during cross-validation.

Usage

Apply this heuristic whenever combining resampling with cross-validation or train-test evaluation. If you are calling `fit_resample()` manually before `cross_validate()` or `train_test_split()`, you are likely leaking data. Use `imblearn.pipeline.make_pipeline` to wrap the sampler and estimator together.

The Insight (Rule of Thumb)

  • Action: Never call `sampler.fit_resample(X, y)` on the full dataset before splitting. Instead, wrap the sampler and classifier in an `imblearn.pipeline.Pipeline`.
  • Value: Wrong approach: CV = 0.724, left-out = 0.698 (2.6% gap reveals leakage). Correct approach: CV = 0.732, left-out = 0.727 (consistent results).
  • Trade-off: None. Using Pipeline is strictly better; it prevents leakage and adds no computational cost.

Reasoning

The documentation (`doc/common_pitfalls.rst`) provides empirical evidence using the adult census dataset with `RandomUnderSampler`:

Wrong pattern (resample then cross-validate):

  • Cross-validation balanced accuracy: 0.724 +/- 0.042
  • Left-out test set accuracy: 0.698 +/- 0.014
  • The 2.6% gap between CV and test performance reveals over-optimistic results

Correct pattern (Pipeline handles resampling per fold):

  • Cross-validation balanced accuracy: 0.732 +/- 0.019
  • Left-out test set accuracy: 0.727 +/- 0.008
  • Results are consistent, with no sign of data leakage

The Pipeline applies resampling only during `fit()`, never on test folds, preserving the natural class distribution for evaluation.

Code Evidence

Wrong pattern from `doc/common_pitfalls.rst:116-130`:

# WRONG: resampling entire dataset before cross-validation
sampler = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = sampler.fit_resample(X, y)
model = HistGradientBoostingClassifier(random_state=0)
cv_results = cross_validate(
    model, X_resampled, y_resampled, scoring="balanced_accuracy",
    return_train_score=True, return_estimator=True, n_jobs=-1
)
# Balanced accuracy: 0.724 +/- 0.042 (over-optimistic!)

Correct pattern from `doc/common_pitfalls.rst:157-172`:

# CORRECT: Pipeline ensures resampling only on training folds
from imblearn.pipeline import make_pipeline
model = make_pipeline(
    RandomUnderSampler(random_state=0),
    HistGradientBoostingClassifier(random_state=0)
)
cv_results = cross_validate(
    model, X, y, scoring="balanced_accuracy",
    return_train_score=True, return_estimator=True, n_jobs=-1
)
# Balanced accuracy: 0.732 +/- 0.019 (consistent with left-out: 0.727)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment