Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Scikit learn contrib Imbalanced learn Balanced Deep Learning Training

From Leeroopedia



Knowledge Sources
Domains Deep_Learning, Imbalanced_Learning, Keras
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for training Keras neural networks on imbalanced datasets using balanced mini-batch generators that ensure each training batch has an equal class distribution.

Description

This workflow integrates imbalanced-learn with Keras (TensorFlow) to address class imbalance during deep learning training. Instead of resampling the entire dataset upfront, the BalancedBatchGenerator class produces balanced mini-batches on the fly during training. Each batch is constructed by under-sampling the majority class to match the minority class, ensuring the neural network sees equal representation of all classes in every gradient update. An alternative balanced_batch_generator function in the TensorFlow submodule provides a lower-level generator interface.

The approach improves training efficiency (fewer wasted gradient steps on majority-class samples) and model quality (reduced bias toward predicting the majority class), while maintaining compatibility with the standard Keras model.fit API.

Usage

Execute this workflow when training a Keras/TensorFlow neural network on a binary or multi-class classification task with significant class imbalance. This is preferred over whole-dataset resampling for large datasets where duplicating minority samples would dramatically increase memory usage and training time. The balanced batch approach is particularly effective for tabular data with simple feed-forward architectures.

Execution Steps

Step 1: Data Loading and Preprocessing

Load the imbalanced dataset and apply feature preprocessing. Use scikit-learn's ColumnTransformer to handle numerical scaling and categorical encoding. Fit the preprocessor on training data and transform both training and validation sets. Convert the processed features to dense arrays as required by the Keras batch generator.

Key considerations:

  • Fit preprocessing only on training data to prevent leakage
  • BalancedBatchGenerator expects NumPy arrays, not sparse matrices
  • Use StratifiedKFold for cross-validation to preserve class distribution in each fold

Step 2: Neural Network Architecture

Define the Keras Sequential model with fully connected layers appropriate for the feature dimensionality. Include regularization layers (Dropout, BatchNormalization) to prevent overfitting, especially since the effective training set size is limited by the minority class. Compile the model with binary_crossentropy (binary) or categorical_crossentropy (multi-class) loss and an appropriate optimizer.

Key considerations:

  • Input dimension must match the preprocessed feature count
  • Use Dropout and BatchNormalization for regularization
  • The model architecture is independent of the balancing strategy

Step 3: Balanced Batch Generator Setup

Instantiate the BalancedBatchGenerator with the preprocessed training features, labels, and desired batch size. The generator implements the Keras PyDataset (or Sequence) interface, so it can be passed directly to model.fit. Internally, it applies RandomUnderSampler to each batch to achieve class balance. A custom sampler can be provided via the sampler parameter.

Key considerations:

  • Set batch_size to control samples per balanced mini-batch
  • The generator automatically shuffles and rebalances each epoch
  • Set random_state for reproducibility
  • Alternatively, use imblearn.tensorflow.balanced_batch_generator for a raw generator function

Step 4: Model Training

Train the neural network by passing the BalancedBatchGenerator to model.fit as the training data source. The generator yields balanced batches until all minority samples have been seen. Specify the number of epochs based on convergence monitoring. Training with balanced batches typically requires more epochs than standard training but achieves better minority-class recall.

Key considerations:

  • Pass the generator directly to model.fit (no need for x and y arguments)
  • Monitor validation loss to detect overfitting
  • Balanced batches may need more epochs since each epoch sees fewer total samples

Step 5: Evaluation

Evaluate the trained model on the unbalanced test set using probability predictions and ROC-AUC score. The model produces calibrated probability estimates that can be thresholded for the desired precision-recall tradeoff. Compare against a baseline model trained with standard (imbalanced) mini-batches to demonstrate the effect of balanced batch training.

Key considerations:

  • Evaluate on the original imbalanced test distribution (no resampling at test time)
  • Use ROC-AUC for probability-based evaluation
  • Compare training time and performance against imbalanced baseline

Execution Diagram

GitHub URL

Workflow Repository