Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Scikit learn contrib Imbalanced learn SMOTE Resampling Pipeline

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for resampling imbalanced datasets using SMOTE variants within an imblearn Pipeline, training a classifier, and generating predictions on held-out data.

Description

This workflow addresses the most common use case of imbalanced-learn: building a classification pipeline that incorporates resampling as a preprocessing step. The imblearn Pipeline extends scikit-learn's Pipeline to support samplers (objects with a fit_resample method) alongside standard transformers and estimators. The pipeline ensures that resampling is applied only during fit (training), never during predict or score, which prevents data leakage during cross-validation.

The workflow covers dataset preparation, sampler selection from the SMOTE family (SMOTE, SMOTENC for mixed features, SMOTEN for categorical features, BorderlineSMOTE, SVMSMOTE, KMeansSMOTE, ADASYN), optional combination with under-sampling cleaning (SMOTEENN, SMOTETomek), pipeline construction, model training, and prediction.

Usage

Execute this workflow when you have a labeled dataset with significant class imbalance (minority class representing less than 20-30% of samples) and need to train a scikit-learn compatible classifier. The pipeline approach is preferred over manual resampling because it correctly handles cross-validation splits, preventing synthetic samples from leaking into validation folds.

Execution Steps

Step 1: Data Preparation

Load or generate the classification dataset and split it into training and testing sets using stratified sampling. Stratification ensures both splits preserve the original class distribution. Inspect the class distribution using a counter to understand the imbalance ratio.

Key considerations:

  • Always use stratified splits to preserve class proportions
  • Verify the imbalance ratio to inform sampler choice
  • Separate features (X) and target (y) as NumPy arrays or Pandas DataFrames

Step 2: Sampler Selection

Choose the appropriate SMOTE variant based on the nature of the features in the dataset. For purely numerical features, use standard SMOTE or its border-aware variants (BorderlineSMOTE, SVMSMOTE, KMeansSMOTE). For datasets with mixed numerical and categorical features, use SMOTENC with the categorical feature indices specified. For purely categorical data, use SMOTEN. If synthetic samples introduce noise near class boundaries, consider a combined method (SMOTEENN or SMOTETomek) that applies cleaning after over-sampling.

Selection guide:

  • Numerical features only: SMOTE, BorderlineSMOTE, SVMSMOTE, KMeansSMOTE, or ADASYN
  • Mixed numerical and categorical: SMOTENC (requires specifying categorical_features)
  • Categorical features only: SMOTEN
  • Noisy boundaries: SMOTEENN (aggressive cleaning) or SMOTETomek (mild cleaning)

Step 3: Pipeline Construction

Assemble the processing pipeline using imblearn.pipeline.make_pipeline (not sklearn's). Include optional preprocessing steps (scaling, imputation, PCA), the chosen sampler, and the final classifier. The imblearn Pipeline routes fit_resample calls for sampler steps and fit_transform calls for transformer steps during training.

Key considerations:

  • Must use imblearn's Pipeline or make_pipeline, not sklearn's, to support sampler steps
  • Samplers must appear before the final estimator
  • Multiple samplers and transformers can be combined in any order
  • The pipeline is fully compatible with sklearn's cross_validate and GridSearchCV

Step 4: Model Training

Fit the pipeline on the training data. During fit, each transformer calls fit_transform and each sampler calls fit_resample in sequence. The final estimator receives the transformed and resampled data. The sampler's sampling_strategy parameter controls the target class distribution after resampling (default is to equalize all classes).

Key considerations:

  • Resampling only happens during fit, never during predict
  • The sampling_strategy parameter accepts a float (ratio), dict (per-class counts), or string ('minority', 'not minority', 'all')
  • Set random_state on the sampler for reproducibility

Step 5: Prediction and Cross_Validation

Use the trained pipeline to predict on unseen test data. During predict, sampler steps are bypassed entirely, and only transformer steps apply transform. For robust evaluation, use sklearn.model_selection.cross_validate with the imblearn Pipeline directly, which correctly resamples only within each training fold.

Key considerations:

  • Predictions are made on the original (non-resampled) test data
  • Cross-validation with the pipeline prevents data leakage from synthetic samples
  • Use balanced_accuracy or other imbalance-aware metrics for scoring

Execution Diagram

GitHub URL

Workflow Repository