Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Scikit learn contrib Imbalanced learn Ensemble Imbalanced Classification

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Ensemble_Methods, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for training imbalance-aware ensemble classifiers that internally balance each bootstrap sample or boosting iteration, without requiring a separate resampling step.

Description

This workflow uses imbalanced-learn's ensemble classifiers as drop-in replacements for scikit-learn's ensemble methods. Instead of manually adding a resampling step to the pipeline, these classifiers integrate under-sampling directly into the ensemble learning process. Each base learner in the ensemble is trained on a balanced subset, producing diverse learners that collectively reduce majority-class bias.

Four ensemble classifiers are available: BalancedRandomForestClassifier (per-tree under-sampling in a random forest), BalancedBaggingClassifier (per-bag resampling with any base estimator), EasyEnsembleClassifier (bag of AdaBoost classifiers on balanced subsets), and RUSBoostClassifier (AdaBoost with per-iteration random under-sampling). These classifiers follow the scikit-learn estimator API and work with standard preprocessing pipelines, cross-validation, and grid search.

Usage

Execute this workflow when you need a self-contained classifier that handles class imbalance without explicit resampling steps. This approach is particularly effective for tree-based models and when ensemble diversity from different balanced subsets improves generalization. It is simpler than a full resampling pipeline and often outperforms single-sampler approaches because each base learner sees a different balanced view of the data.

Execution Steps

Step 1: Data Loading and Splitting

Load the imbalanced dataset and split it into training and testing sets with stratified sampling. Inspect the class distribution to understand the severity of the imbalance. For benchmark experiments, use imblearn.datasets.fetch_datasets to load standard imbalanced datasets from the Zenodo repository.

Key considerations:

  • Use stratified train/test split to preserve class proportions
  • Check imbalance ratio to calibrate ensemble parameters
  • fetch_datasets returns datasets in a dict keyed by dataset name

Step 2: Preprocessing

Apply feature preprocessing appropriate to the chosen ensemble classifier. Tree-based ensembles (BalancedRandomForest, BalancedBagging with tree estimators) require only ordinal encoding for categorical features and simple imputation. For gradient boosting base estimators within BalancedBagging, use HistGradientBoostingClassifier which handles missing values natively.

Key considerations:

  • Tree-based classifiers do not need feature scaling
  • Use ColumnTransformer to handle mixed feature types
  • Preprocessing goes in a standard sklearn Pipeline wrapping the balanced classifier

Step 3: Classifier Selection and Configuration

Choose the ensemble classifier based on the use case. BalancedRandomForestClassifier provides the best balance of speed and performance for most problems. BalancedBaggingClassifier offers flexibility by accepting any base estimator. EasyEnsembleClassifier and RUSBoostClassifier provide boosting-based alternatives.

Selection guide:

  • General purpose: BalancedRandomForestClassifier with sampling_strategy='all' and replacement=True
  • Custom base estimator needed: BalancedBaggingClassifier wrapping the desired estimator
  • Boosting approach: EasyEnsembleClassifier (bag of AdaBoost) or RUSBoostClassifier (RUS within AdaBoost)
  • Configure n_estimators to control ensemble size and performance-computation tradeoff

Step 4: Training

Fit the ensemble classifier on the training data. Each base learner internally draws a balanced bootstrap sample before training. The sampling_strategy parameter on the classifier controls how balancing is performed. Training is parallelizable via the n_jobs parameter for bagging and forest classifiers.

Key considerations:

  • Set random_state for reproducibility
  • Use n_jobs for parallel training of base learners
  • BalancedRandomForest uses replacement=True and bootstrap=False by default for balanced sampling

Step 5: Evaluation

Evaluate the trained ensemble on the test set using imbalance-aware metrics. Use balanced_accuracy_score from scikit-learn and geometric_mean_score from imblearn. Visualize results with confusion matrix displays. Compare balanced ensemble performance against unbalanced baselines to demonstrate the effect of internal resampling.

Key considerations:

  • Use balanced_accuracy_score rather than plain accuracy for imbalanced evaluation
  • geometric_mean_score captures per-class sensitivity balance
  • ConfusionMatrixDisplay helps visualize per-class prediction quality
  • Compare against sklearn's standard ensemble equivalents as baselines

Execution Diagram

GitHub URL

Workflow Repository