Workflow:Scikit learn contrib Imbalanced learn Ensemble Imbalanced Classification

Knowledge Sources	imbalanced-learn imbalanced-learn Ensemble Docs scikit-learn Ensemble Methods
Domains	Machine_Learning, Ensemble_Methods, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

End-to-end process for training imbalance-aware ensemble classifiers that internally balance each bootstrap sample or boosting iteration, without requiring a separate resampling step.

Description

This workflow uses imbalanced-learn's ensemble classifiers as drop-in replacements for scikit-learn's ensemble methods. Instead of manually adding a resampling step to the pipeline, these classifiers integrate under-sampling directly into the ensemble learning process. Each base learner in the ensemble is trained on a balanced subset, producing diverse learners that collectively reduce majority-class bias.

Four ensemble classifiers are available: BalancedRandomForestClassifier (per-tree under-sampling in a random forest), BalancedBaggingClassifier (per-bag resampling with any base estimator), EasyEnsembleClassifier (bag of AdaBoost classifiers on balanced subsets), and RUSBoostClassifier (AdaBoost with per-iteration random under-sampling). These classifiers follow the scikit-learn estimator API and work with standard preprocessing pipelines, cross-validation, and grid search.

Usage

Execute this workflow when you need a self-contained classifier that handles class imbalance without explicit resampling steps. This approach is particularly effective for tree-based models and when ensemble diversity from different balanced subsets improves generalization. It is simpler than a full resampling pipeline and often outperforms single-sampler approaches because each base learner sees a different balanced view of the data.

Execution Steps

Step 1: Data Loading and Splitting

Load the imbalanced dataset and split it into training and testing sets with stratified sampling. Inspect the class distribution to understand the severity of the imbalance. For benchmark experiments, use imblearn.datasets.fetch_datasets to load standard imbalanced datasets from the Zenodo repository.

Key considerations:

Use stratified train/test split to preserve class proportions
Check imbalance ratio to calibrate ensemble parameters
fetch_datasets returns datasets in a dict keyed by dataset name

Step 2: Preprocessing

Apply feature preprocessing appropriate to the chosen ensemble classifier. Tree-based ensembles (BalancedRandomForest, BalancedBagging with tree estimators) require only ordinal encoding for categorical features and simple imputation. For gradient boosting base estimators within BalancedBagging, use HistGradientBoostingClassifier which handles missing values natively.

Key considerations:

Tree-based classifiers do not need feature scaling
Use ColumnTransformer to handle mixed feature types
Preprocessing goes in a standard sklearn Pipeline wrapping the balanced classifier

Step 3: Classifier Selection and Configuration

Choose the ensemble classifier based on the use case. BalancedRandomForestClassifier provides the best balance of speed and performance for most problems. BalancedBaggingClassifier offers flexibility by accepting any base estimator. EasyEnsembleClassifier and RUSBoostClassifier provide boosting-based alternatives.

Selection guide:

General purpose: BalancedRandomForestClassifier with sampling_strategy='all' and replacement=True
Custom base estimator needed: BalancedBaggingClassifier wrapping the desired estimator
Boosting approach: EasyEnsembleClassifier (bag of AdaBoost) or RUSBoostClassifier (RUS within AdaBoost)
Configure n_estimators to control ensemble size and performance-computation tradeoff

Step 4: Training

Fit the ensemble classifier on the training data. Each base learner internally draws a balanced bootstrap sample before training. The sampling_strategy parameter on the classifier controls how balancing is performed. Training is parallelizable via the n_jobs parameter for bagging and forest classifiers.

Key considerations:

Set random_state for reproducibility
Use n_jobs for parallel training of base learners
BalancedRandomForest uses replacement=True and bootstrap=False by default for balanced sampling

Step 5: Evaluation

Evaluate the trained ensemble on the test set using imbalance-aware metrics. Use balanced_accuracy_score from scikit-learn and geometric_mean_score from imblearn. Visualize results with confusion matrix displays. Compare balanced ensemble performance against unbalanced baselines to demonstrate the effect of internal resampling.

Key considerations:

Use balanced_accuracy_score rather than plain accuracy for imbalanced evaluation
geometric_mean_score captures per-class sensitivity balance
ConfusionMatrixDisplay helps visualize per-class prediction quality
Compare against sklearn's standard ensemble equivalents as baselines

Execution Diagram

GitHub URL

Workflow Repository