Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Sketch Histogram

From Leeroopedia
Revision as of 16:10, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Online_ml_River_Sketch_Histogram.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Streaming_Algorithms, Statistics
Last Updated 2026-02-08 16:00 GMT

Overview

Streaming histogram using adaptive binning with fixed memory for online probability distribution estimation.

Description

The Histogram class implements a streaming histogram that maintains an approximate distribution using a fixed number of bins. It automatically merges the closest bins when capacity is reached, providing CDF and probability estimates with bounded memory. Based on Ben-Haim and Tom-Tov's algorithm, it maintains accuracy while processing unbounded data streams.

Usage

Use this for online quantile estimation, anomaly detection, or understanding data distributions in streaming settings where storing all values is impractical. Particularly useful for monitoring system metrics, detecting distribution shifts, or adaptive binning in decision trees.

Code Reference

Source Location

Signature

class Histogram(collections.UserList, base.Base):
    def __init__(self, max_bins=256):
        ...

    def update(self, x):
        ...

    def cdf(self, x):
        ...

    def iter_cdf(self, X, verbose=False):
        ...

Import

from river import sketch

I/O Contract

Parameter Type Description
max_bins int Maximum number of bins (default: 256)

Methods:

Method Returns Description
update(x) None Add a value to the histogram
cdf(x) float Cumulative distribution function at x
iter_cdf(X) Iterator[float] CDF values for sorted iterable X

Usage Examples

from river import sketch
import numpy as np

np.random.seed(42)

# Create mixture of two normal distributions
values = np.hstack((
    np.random.normal(-3, 1, 1000),
    np.random.normal(3, 1, 1000),
))

# Build streaming histogram
hist = sketch.Histogram(max_bins=15)

for x in values:
    hist.update(x)

# View bins
print("Histogram bins:")
for bin in hist:
    print(bin)

# Compute CDF values
print(f"\nCDF(-3) = {hist.cdf(-3):.4f}")
print(f"CDF(0) = {hist.cdf(0):.4f}")
print(f"CDF(3) = {hist.cdf(3):.4f}")

# Efficient CDF computation for multiple values
X = [-6, -3, 0, 3, 6]
print("\nCDF for multiple values:")
for x, cdf_val in zip(X, hist.iter_cdf(X)):
    print(f"CDF({x:2d}) = {cdf_val:.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment