Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Online ml River Probabilistic Data Structures

From Leeroopedia
Revision as of 17:49, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Online_ml_River_Probabilistic_Data_Structures.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Streaming_Algorithms
Last Updated 2026-02-08 18:00 GMT

Overview

Probabilistic data structures (also called sketches) are compact, fixed-memory summaries that answer queries about a data stream with bounded error guarantees. They trade exact answers for massive reductions in memory, making them indispensable when processing unbounded data streams where storing all elements is infeasible.

These structures underpin many streaming analytics tasks: frequency estimation, cardinality counting, quantile approximation, and heavy hitter detection.

Theoretical Basis

Frequency Estimation (Count-Min Sketch)

The Count-Min Sketch estimates the frequency of any element in a stream using a two-dimensional array of counters with d hash functions and w buckets per row:

Update: for each hash function h_j, increment table[j][h_j(x)]
Query:  f_hat(x) = min_j table[j][h_j(x)]

The estimate is always an overcount. With d = ln(1/delta) and w = e/epsilon, the error is at most epsilon * N with probability 1 - delta, where N is the total count.

Heavy Hitter Detection

Heavy hitters are elements whose frequency exceeds a threshold fraction of the stream. Algorithms like Space-Saving and Misra-Gries maintain a small set of candidate elements with approximate counts, evicting the least frequent when capacity is reached. They guarantee finding all true heavy hitters with bounded false positives.

Streaming Histograms

Online histograms approximate the distribution of a numerical stream using a fixed number of bins. As new values arrive, bins are merged or split to maintain the approximation quality. This enables approximate quantile, rank, and density queries with bounded memory.

Set Membership and Cardinality

  • Bloom filters: Test whether an element has been seen before with no false negatives and a controllable false positive rate.
  • HyperLogLog: Estimates the number of distinct elements (cardinality) using O(log log N) memory per register.

Online set structures track the unique elements seen so far, supporting membership queries and approximate cardinality estimation.

Error Guarantees

Structure Query Space Error Bound
Count-Min Sketch Frequency O(1/epsilon * log(1/delta)) epsilon * N
Space-Saving Heavy hitters O(1/epsilon) epsilon * N
Streaming Histogram Quantiles O(B) bins Depends on merge policy
HyperLogLog Cardinality O(m) registers ~1.04 / sqrt(m)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment