Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Scikit learn contrib Imbalanced learn Sampler Aware Pipeline

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Software_Engineering, Data_Pipeline
Last Updated 2026-02-09 03:00 GMT

Overview

A pipeline abstraction that extends scikit-learn's Pipeline to support resampling steps alongside standard transformers and estimators, ensuring correct data leakage prevention during cross-validation.

Description

Standard scikit-learn pipelines only support estimators with fit/transform interfaces. When resampling is needed (e.g., SMOTE), applying it outside the pipeline causes data leakage: test fold data gets resampled during cross-validation. A sampler-aware pipeline adds support for estimators with fit_resample interfaces, ensuring resampling only occurs during fitting (training) and is correctly excluded during prediction and evaluation.

This is critical for valid model evaluation with imbalanced data.

Usage

Use this principle whenever combining resampling steps with preprocessing transformers and a final estimator. Always prefer the imbalanced-learn Pipeline over manually applying samplers, to avoid data leakage in cross-validation.

Theoretical Basis

The pipeline processes steps sequentially:

  1. For each intermediate step:
    • If the step has fit_resample: call it during fit only (resample training data)
    • If the step has fit_transform: call it during both fit and predict
  2. The final step: call fit during training, predict during evaluation
# Abstract sampler-aware pipeline logic (NOT real implementation)
def fit(X, y):
    for step in intermediate_steps:
        if has_fit_resample(step):
            X, y = step.fit_resample(X, y)  # Only during fit
        elif has_fit_transform(step):
            X = step.fit_transform(X, y)
    final_estimator.fit(X, y)

def predict(X):
    for step in intermediate_steps:
        if has_transform(step):
            X = step.transform(X)  # No resampling during predict
    return final_estimator.predict(X)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment