Principle:Scikit learn contrib Imbalanced learn Sampler Aware Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Software_Engineering, Data_Pipeline |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A pipeline abstraction that extends scikit-learn's Pipeline to support resampling steps alongside standard transformers and estimators, ensuring correct data leakage prevention during cross-validation.
Description
Standard scikit-learn pipelines only support estimators with fit/transform interfaces. When resampling is needed (e.g., SMOTE), applying it outside the pipeline causes data leakage: test fold data gets resampled during cross-validation. A sampler-aware pipeline adds support for estimators with fit_resample interfaces, ensuring resampling only occurs during fitting (training) and is correctly excluded during prediction and evaluation.
This is critical for valid model evaluation with imbalanced data.
Usage
Use this principle whenever combining resampling steps with preprocessing transformers and a final estimator. Always prefer the imbalanced-learn Pipeline over manually applying samplers, to avoid data leakage in cross-validation.
Theoretical Basis
The pipeline processes steps sequentially:
- For each intermediate step:
- If the step has fit_resample: call it during fit only (resample training data)
- If the step has fit_transform: call it during both fit and predict
- The final step: call fit during training, predict during evaluation
# Abstract sampler-aware pipeline logic (NOT real implementation)
def fit(X, y):
for step in intermediate_steps:
if has_fit_resample(step):
X, y = step.fit_resample(X, y) # Only during fit
elif has_fit_transform(step):
X = step.fit_transform(X, y)
final_estimator.fit(X, y)
def predict(X):
for step in intermediate_steps:
if has_transform(step):
X = step.transform(X) # No resampling during predict
return final_estimator.predict(X)