Implementation:Online ml River Imblearn HardSampling
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Imbalanced_Learning, Active_Learning, Curriculum_Learning |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Hard Sampling maintains a buffer of difficult-to-predict samples and probabilistically retrains on them, helping models learn from their mistakes.
Description
This method keeps a fixed-size buffer of the hardest samples as measured by prediction loss. When a new sample arrives, its loss is computed and if it exceeds the smallest loss in the buffer (when buffer is full), it replaces that sample. At each step, with probability p the model trains on a random sample from the buffer (updating its loss afterward), and with probability 1-p it trains on the new incoming sample. The buffer is kept sorted by loss for efficient operations. This creates a curriculum where the model frequently revisits challenging examples. Works for both regression (using regression losses like MAE) and classification (using cross-entropy or log loss).
Usage
Use Hard Sampling when your model struggles with certain types of examples and you want it to focus more on learning those patterns. The size parameter controls buffer capacity - larger buffers remember more hard cases but use more memory. The p parameter controls the trade-off between learning from hard samples versus new samples; higher p means more focus on hard cases. This technique is particularly effective for imbalanced problems where the model tends to favor the majority class, or when data contains distinct difficulty levels. Set the loss function to match your problem type and what you want to emphasize.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/imblearn/hard_sampling.py
Signature
class HardSamplingRegressor(
regressor: base.Regressor,
size: int,
p: float,
loss: optim.losses.RegressionLoss | None = None,
seed: int | None = None,
)
class HardSamplingClassifier(
classifier: base.Classifier,
size: int,
p: float,
loss: optim.losses.BinaryLoss | optim.losses.MultiClassLoss | None = None,
seed: int | None = None,
)
Import
from river import imblearn
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| x | dict | Feature dictionary |
| y | Any | Target value or class label |
| Method | Return Type | Description |
|---|---|---|
| predict_one(x) | Any | Delegates to wrapped model |
| predict_proba_one(x) | dict | Class probabilities (classifier only) |
| learn_one(x, y) | None | Trains on buffer sample with prob p, new sample with prob 1-p |
Usage Examples
from river import datasets
from river import evaluate
from river import imblearn
from river import linear_model
from river import metrics
from river import preprocessing
# Regression example
model = (
preprocessing.StandardScaler() |
imblearn.HardSamplingRegressor(
regressor=linear_model.LinearRegression(),
p=.2,
size=30,
seed=42,
)
)
result = evaluate.progressive_val_score(
datasets.TrumpApproval(),
model,
metrics.MAE(),
print_every=500
)
print(result) # MAE: 1.391246
# Classification example
model = (
preprocessing.StandardScaler() |
imblearn.HardSamplingClassifier(
classifier=linear_model.LogisticRegression(),
p=0.1,
size=40,
seed=42,
)
)
result = evaluate.progressive_val_score(
dataset=datasets.Phishing(),
model=model,
metric=metrics.ROCAUC(),
print_every=500,
)
print(result) # ROCAUC: 95.06%