Implementation:Online ml River Sketch Histogram

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Streaming_Algorithms, Statistics
Last Updated	2026-02-08 16:00 GMT

Overview

Streaming histogram using adaptive binning with fixed memory for online probability distribution estimation.

Description

The Histogram class implements a streaming histogram that maintains an approximate distribution using a fixed number of bins. It automatically merges the closest bins when capacity is reached, providing CDF and probability estimates with bounded memory. Based on Ben-Haim and Tom-Tov's algorithm, it maintains accuracy while processing unbounded data streams.

Usage

Use this for online quantile estimation, anomaly detection, or understanding data distributions in streaming settings where storing all values is impractical. Particularly useful for monitoring system metrics, detecting distribution shifts, or adaptive binning in decision trees.

Code Reference

Source Location

Repository: Online_ml_River
File: river/sketch/histogram.py

Signature

class Histogram(collections.UserList, base.Base):
    def __init__(self, max_bins=256):
        ...

    def update(self, x):
        ...

    def cdf(self, x):
        ...

    def iter_cdf(self, X, verbose=False):
        ...

Import

from river import sketch

I/O Contract

Parameter	Type	Description
max_bins	int	Maximum number of bins (default: 256)

Methods:

Method	Returns	Description
update(x)	None	Add a value to the histogram
cdf(x)	float	Cumulative distribution function at x
iter_cdf(X)	Iterator[float]	CDF values for sorted iterable X

Usage Examples

from river import sketch
import numpy as np

np.random.seed(42)

# Create mixture of two normal distributions
values = np.hstack((
    np.random.normal(-3, 1, 1000),
    np.random.normal(3, 1, 1000),
))

# Build streaming histogram
hist = sketch.Histogram(max_bins=15)

for x in values:
    hist.update(x)

# View bins
print("Histogram bins:")
for bin in hist:
    print(bin)

# Compute CDF values
print(f"\nCDF(-3) = {hist.cdf(-3):.4f}")
print(f"CDF(0) = {hist.cdf(0):.4f}")
print(f"CDF(3) = {hist.cdf(3):.4f}")

# Efficient CDF computation for multiple values
X = [-6, -3, 0, 3, 6]
print("\nCDF for multiple values:")
for x, cdf_val in zip(X, hist.iter_cdf(X)):
    print(f"CDF({x:2d}) = {cdf_val:.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment