Principle:Scikit learn contrib Imbalanced learn Sampler Aware Pipeline

Knowledge Sources	imbalanced-learn Pipeline scikit-learn Pipeline
Domains	Machine_Learning, Software_Engineering, Data_Pipeline
Last Updated	2026-02-09 03:00 GMT

Overview

A pipeline abstraction that extends scikit-learn's Pipeline to support resampling steps alongside standard transformers and estimators, ensuring correct data leakage prevention during cross-validation.

Description

Standard scikit-learn pipelines only support estimators with fit/transform interfaces. When resampling is needed (e.g., SMOTE), applying it outside the pipeline causes data leakage: test fold data gets resampled during cross-validation. A sampler-aware pipeline adds support for estimators with fit_resample interfaces, ensuring resampling only occurs during fitting (training) and is correctly excluded during prediction and evaluation.

This is critical for valid model evaluation with imbalanced data.

Usage

Use this principle whenever combining resampling steps with preprocessing transformers and a final estimator. Always prefer the imbalanced-learn Pipeline over manually applying samplers, to avoid data leakage in cross-validation.

Theoretical Basis

The pipeline processes steps sequentially:

For each intermediate step:
- If the step has fit_resample: call it during fit only (resample training data)
- If the step has fit_transform: call it during both fit and predict
The final step: call fit during training, predict during evaluation

# Abstract sampler-aware pipeline logic (NOT real implementation)
def fit(X, y):
    for step in intermediate_steps:
        if has_fit_resample(step):
            X, y = step.fit_resample(X, y)  # Only during fit
        elif has_fit_transform(step):
            X = step.fit_transform(X, y)
    final_estimator.fit(X, y)

def predict(X):
    for step in intermediate_steps:
        if has_transform(step):
            X = step.transform(X)  # No resampling during predict
    return final_estimator.predict(X)

Related Pages

Implemented By

Implementation:Scikit_learn_contrib_Imbalanced_learn_Pipeline

Uses Heuristic

Heuristic:Scikit_learn_contrib_Imbalanced_learn_Sampling_Before_Split_Leakage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment