Principle:Online ml River Probabilistic Data Structures
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Streaming_Algorithms |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Probabilistic data structures (also called sketches) are compact, fixed-memory summaries that answer queries about a data stream with bounded error guarantees. They trade exact answers for massive reductions in memory, making them indispensable when processing unbounded data streams where storing all elements is infeasible.
These structures underpin many streaming analytics tasks: frequency estimation, cardinality counting, quantile approximation, and heavy hitter detection.
Theoretical Basis
Frequency Estimation (Count-Min Sketch)
The Count-Min Sketch estimates the frequency of any element in a stream using a two-dimensional array of counters with d hash functions and w buckets per row:
Update: for each hash function h_j, increment table[j][h_j(x)]
Query: f_hat(x) = min_j table[j][h_j(x)]
The estimate is always an overcount. With d = ln(1/delta) and w = e/epsilon, the error is at most epsilon * N with probability 1 - delta, where N is the total count.
Heavy Hitter Detection
Heavy hitters are elements whose frequency exceeds a threshold fraction of the stream. Algorithms like Space-Saving and Misra-Gries maintain a small set of candidate elements with approximate counts, evicting the least frequent when capacity is reached. They guarantee finding all true heavy hitters with bounded false positives.
Streaming Histograms
Online histograms approximate the distribution of a numerical stream using a fixed number of bins. As new values arrive, bins are merged or split to maintain the approximation quality. This enables approximate quantile, rank, and density queries with bounded memory.
Set Membership and Cardinality
- Bloom filters: Test whether an element has been seen before with no false negatives and a controllable false positive rate.
- HyperLogLog: Estimates the number of distinct elements (cardinality) using O(log log N) memory per register.
Online set structures track the unique elements seen so far, supporting membership queries and approximate cardinality estimation.
Error Guarantees
| Structure | Query | Space | Error Bound |
|---|---|---|---|
| Count-Min Sketch | Frequency | O(1/epsilon * log(1/delta)) | epsilon * N |
| Space-Saving | Heavy hitters | O(1/epsilon) | epsilon * N |
| Streaming Histogram | Quantiles | O(B) bins | Depends on merge policy |
| HyperLogLog | Cardinality | O(m) registers | ~1.04 / sqrt(m) |