Implementation:Online ml River Sketch Histogram
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Streaming_Algorithms, Statistics |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Streaming histogram using adaptive binning with fixed memory for online probability distribution estimation.
Description
The Histogram class implements a streaming histogram that maintains an approximate distribution using a fixed number of bins. It automatically merges the closest bins when capacity is reached, providing CDF and probability estimates with bounded memory. Based on Ben-Haim and Tom-Tov's algorithm, it maintains accuracy while processing unbounded data streams.
Usage
Use this for online quantile estimation, anomaly detection, or understanding data distributions in streaming settings where storing all values is impractical. Particularly useful for monitoring system metrics, detecting distribution shifts, or adaptive binning in decision trees.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/sketch/histogram.py
Signature
class Histogram(collections.UserList, base.Base):
def __init__(self, max_bins=256):
...
def update(self, x):
...
def cdf(self, x):
...
def iter_cdf(self, X, verbose=False):
...
Import
from river import sketch
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| max_bins | int | Maximum number of bins (default: 256) |
Methods:
| Method | Returns | Description |
|---|---|---|
| update(x) | None | Add a value to the histogram |
| cdf(x) | float | Cumulative distribution function at x |
| iter_cdf(X) | Iterator[float] | CDF values for sorted iterable X |
Usage Examples
from river import sketch
import numpy as np
np.random.seed(42)
# Create mixture of two normal distributions
values = np.hstack((
np.random.normal(-3, 1, 1000),
np.random.normal(3, 1, 1000),
))
# Build streaming histogram
hist = sketch.Histogram(max_bins=15)
for x in values:
hist.update(x)
# View bins
print("Histogram bins:")
for bin in hist:
print(bin)
# Compute CDF values
print(f"\nCDF(-3) = {hist.cdf(-3):.4f}")
print(f"CDF(0) = {hist.cdf(0):.4f}")
print(f"CDF(3) = {hist.cdf(3):.4f}")
# Efficient CDF computation for multiple values
X = [-6, -3, 0, 3, 6]
print("\nCDF for multiple values:")
for x, cdf_val in zip(X, hist.iter_cdf(X)):
print(f"CDF({x:2d}) = {cdf_val:.4f}")