Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Imbalanced Learning

From Leeroopedia


Knowledge Sources Domains Last Updated
Machine Learning Imbalanced Data Classification Online_Learning, Imbalanced_Learning, Classification 2026-02-08 18:00 GMT

Overview

Imbalanced learning addresses the challenge of training classifiers when class frequencies are highly skewed. In online learning, sampling strategies must operate on individual instances as they arrive, adjusting the effective class distribution seen by the learner without access to the full dataset.

Description

Class imbalance is pervasive in real-world applications: fraud detection (< 1% fraud), medical diagnosis (rare diseases), anomaly detection, and many others. When one class significantly outnumbers others, standard classifiers tend to be biased toward the majority class because minimizing overall error favors predicting the majority class.

In the batch setting, common remedies include oversampling the minority class (SMOTE), undersampling the majority class, or adjusting class weights. In the online learning setting, these strategies must be adapted to work instance by instance:

Random oversampling/undersampling: For each incoming instance, a sampling strategy decides whether to pass it to the learner, potentially multiple times (oversampling) or not at all (undersampling). The sampling probability depends on the instance's class and the desired target distribution.

Hard sampling: Instead of random selection, instances are sampled based on their difficulty. Instances that the model currently misclassifies (hard examples) are preferentially included in training, while easy examples may be skipped. This focuses learning capacity on the decision boundary.

Distribution-aware sampling (Chebyshev): Uses Chebyshev's inequality to determine sampling rates that bring the effective class distribution closer to a target distribution. This provides a principled, distribution-free approach to rebalancing that does not assume any particular form for the class distribution.

Usage

Use imbalanced learning strategies when:

  • The class distribution in your data stream is significantly skewed.
  • Standard accuracy is a misleading metric due to class imbalance.
  • You want to improve minority class recall without retraining from scratch.
  • You need an online-compatible approach to class rebalancing.

Theoretical Basis

Random Sampling

Given desired sampling ratio r for class c:
For each instance (x, y):
    if y == minority_class:
        k = sample_weight(r)    # e.g., k ~ Poisson(r) for oversampling
        Train model on (x, y) with weight k
    else:
        k = sample_weight(1/r)  # undersampling probability
        if k > 0: Train model on (x, y) with weight k

Hard Sampling

For each instance (x, y):
    y_pred = model.predict(x)
    if y_pred != y:
        # Hard example: always include, possibly with higher weight
        Train model on (x, y) with weight w_hard
    else:
        # Easy example: include with probability p_easy
        if random() < p_easy:
            Train model on (x, y)

Chebyshev Sampling

Chebyshev's inequality states that for any distribution with mean μ and variance σ2:

P(|X - mu| >= k * sigma) <= 1 / k^2

This bound is distribution-free and can be used to determine sampling rates: instances from underrepresented classes are assigned higher sampling probabilities proportional to how far the current class distribution deviates from the target distribution, scaled by the Chebyshev bound.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment