Principle:Online ml River Imbalanced Learning
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Machine Learning Imbalanced Data Classification | Online_Learning, Imbalanced_Learning, Classification | 2026-02-08 18:00 GMT |
Overview
Imbalanced learning addresses the challenge of training classifiers when class frequencies are highly skewed. In online learning, sampling strategies must operate on individual instances as they arrive, adjusting the effective class distribution seen by the learner without access to the full dataset.
Description
Class imbalance is pervasive in real-world applications: fraud detection (< 1% fraud), medical diagnosis (rare diseases), anomaly detection, and many others. When one class significantly outnumbers others, standard classifiers tend to be biased toward the majority class because minimizing overall error favors predicting the majority class.
In the batch setting, common remedies include oversampling the minority class (SMOTE), undersampling the majority class, or adjusting class weights. In the online learning setting, these strategies must be adapted to work instance by instance:
Random oversampling/undersampling: For each incoming instance, a sampling strategy decides whether to pass it to the learner, potentially multiple times (oversampling) or not at all (undersampling). The sampling probability depends on the instance's class and the desired target distribution.
Hard sampling: Instead of random selection, instances are sampled based on their difficulty. Instances that the model currently misclassifies (hard examples) are preferentially included in training, while easy examples may be skipped. This focuses learning capacity on the decision boundary.
Distribution-aware sampling (Chebyshev): Uses Chebyshev's inequality to determine sampling rates that bring the effective class distribution closer to a target distribution. This provides a principled, distribution-free approach to rebalancing that does not assume any particular form for the class distribution.
Usage
Use imbalanced learning strategies when:
- The class distribution in your data stream is significantly skewed.
- Standard accuracy is a misleading metric due to class imbalance.
- You want to improve minority class recall without retraining from scratch.
- You need an online-compatible approach to class rebalancing.
Theoretical Basis
Random Sampling
Given desired sampling ratio r for class c:
For each instance (x, y):
if y == minority_class:
k = sample_weight(r) # e.g., k ~ Poisson(r) for oversampling
Train model on (x, y) with weight k
else:
k = sample_weight(1/r) # undersampling probability
if k > 0: Train model on (x, y) with weight k
Hard Sampling
For each instance (x, y):
y_pred = model.predict(x)
if y_pred != y:
# Hard example: always include, possibly with higher weight
Train model on (x, y) with weight w_hard
else:
# Easy example: include with probability p_easy
if random() < p_easy:
Train model on (x, y)
Chebyshev Sampling
Chebyshev's inequality states that for any distribution with mean and variance :
P(|X - mu| >= k * sigma) <= 1 / k^2
This bound is distribution-free and can be used to determine sampling rates: instances from underrepresented classes are assigned higher sampling probabilities proportional to how far the current class distribution deviates from the target distribution, scaled by the Chebyshev bound.