Principle:Online ml River Probabilistic Data Structures

Knowledge Sources	Streaming Algorithms Approximate Frequency Counts over Data Streams
Domains	Online_Learning, Streaming_Algorithms
Last Updated	2026-02-08 18:00 GMT

Overview

Probabilistic data structures (also called sketches) are compact, fixed-memory summaries that answer queries about a data stream with bounded error guarantees. They trade exact answers for massive reductions in memory, making them indispensable when processing unbounded data streams where storing all elements is infeasible.

These structures underpin many streaming analytics tasks: frequency estimation, cardinality counting, quantile approximation, and heavy hitter detection.

Theoretical Basis

Frequency Estimation (Count-Min Sketch)

The Count-Min Sketch estimates the frequency of any element in a stream using a two-dimensional array of counters with d hash functions and w buckets per row:

Update: for each hash function h_j, increment table[j][h_j(x)]
Query:  f_hat(x) = min_j table[j][h_j(x)]

The estimate is always an overcount. With d = ln(1/delta) and w = e/epsilon, the error is at most epsilon * N with probability 1 - delta, where N is the total count.

Heavy Hitter Detection

Heavy hitters are elements whose frequency exceeds a threshold fraction of the stream. Algorithms like Space-Saving and Misra-Gries maintain a small set of candidate elements with approximate counts, evicting the least frequent when capacity is reached. They guarantee finding all true heavy hitters with bounded false positives.

Streaming Histograms

Online histograms approximate the distribution of a numerical stream using a fixed number of bins. As new values arrive, bins are merged or split to maintain the approximation quality. This enables approximate quantile, rank, and density queries with bounded memory.

Set Membership and Cardinality

Bloom filters: Test whether an element has been seen before with no false negatives and a controllable false positive rate.
HyperLogLog: Estimates the number of distinct elements (cardinality) using O(log log N) memory per register.

Online set structures track the unique elements seen so far, supporting membership queries and approximate cardinality estimation.

Error Guarantees

Structure	Query	Space	Error Bound
Count-Min Sketch	Frequency	O(1/epsilon * log(1/delta))	epsilon * N
Space-Saving	Heavy hitters	O(1/epsilon)	epsilon * N
Streaming Histogram	Quantiles	O(B) bins	Depends on merge policy
HyperLogLog	Cardinality	O(m) registers	~1.04 / sqrt(m)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment