Implementation:Online ml River Imblearn HardSampling

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Imbalanced_Learning, Active_Learning, Curriculum_Learning
Last Updated	2026-02-08 16:00 GMT

Overview

Hard Sampling maintains a buffer of difficult-to-predict samples and probabilistically retrains on them, helping models learn from their mistakes.

Description

This method keeps a fixed-size buffer of the hardest samples as measured by prediction loss. When a new sample arrives, its loss is computed and if it exceeds the smallest loss in the buffer (when buffer is full), it replaces that sample. At each step, with probability p the model trains on a random sample from the buffer (updating its loss afterward), and with probability 1-p it trains on the new incoming sample. The buffer is kept sorted by loss for efficient operations. This creates a curriculum where the model frequently revisits challenging examples. Works for both regression (using regression losses like MAE) and classification (using cross-entropy or log loss).

Usage

Use Hard Sampling when your model struggles with certain types of examples and you want it to focus more on learning those patterns. The size parameter controls buffer capacity - larger buffers remember more hard cases but use more memory. The p parameter controls the trade-off between learning from hard samples versus new samples; higher p means more focus on hard cases. This technique is particularly effective for imbalanced problems where the model tends to favor the majority class, or when data contains distinct difficulty levels. Set the loss function to match your problem type and what you want to emphasize.

Code Reference

Source Location

Repository: Online_ml_River
File: river/imblearn/hard_sampling.py

Signature

class HardSamplingRegressor(
    regressor: base.Regressor,
    size: int,
    p: float,
    loss: optim.losses.RegressionLoss | None = None,
    seed: int | None = None,
)

class HardSamplingClassifier(
    classifier: base.Classifier,
    size: int,
    p: float,
    loss: optim.losses.BinaryLoss | optim.losses.MultiClassLoss | None = None,
    seed: int | None = None,
)

Import

from river import imblearn

I/O Contract

Input
Parameter	Type	Description
x	dict	Feature dictionary
y	Any	Target value or class label

Output
Method	Return Type	Description
predict_one(x)	Any	Delegates to wrapped model
predict_proba_one(x)	dict	Class probabilities (classifier only)
learn_one(x, y)	None	Trains on buffer sample with prob p, new sample with prob 1-p

Usage Examples

from river import datasets
from river import evaluate
from river import imblearn
from river import linear_model
from river import metrics
from river import preprocessing

# Regression example
model = (
    preprocessing.StandardScaler() |
    imblearn.HardSamplingRegressor(
        regressor=linear_model.LinearRegression(),
        p=.2,
        size=30,
        seed=42,
    )
)

result = evaluate.progressive_val_score(
    datasets.TrumpApproval(),
    model,
    metrics.MAE(),
    print_every=500
)
print(result)  # MAE: 1.391246

# Classification example
model = (
    preprocessing.StandardScaler() |
    imblearn.HardSamplingClassifier(
        classifier=linear_model.LogisticRegression(),
        p=0.1,
        size=40,
        seed=42,
    )
)

result = evaluate.progressive_val_score(
    dataset=datasets.Phishing(),
    model=model,
    metric=metrics.ROCAUC(),
    print_every=500,
)
print(result)  # ROCAUC: 95.06%

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment