Principle:Duckdb Duckdb Approximate Distinct Counting
| Knowledge Sources | |
|---|---|
| Domains | Probabilistic_Data_Structure, Cardinality_Estimation |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A probabilistic algorithm for estimating the number of distinct elements in a multiset using sub-linear space, providing approximate cardinality counts with controllable accuracy.
Description
HyperLogLog (HLL) is a probabilistic cardinality estimation algorithm that can count the number of distinct elements in a dataset using a fixed, small amount of memory. Invented by Philippe Flajolet et al. in 2007, HyperLogLog builds on the observation that the cardinality of a set can be estimated from the statistical properties of hash values computed from its elements.
The key insight is that if you hash each element to a uniformly distributed binary string, the maximum number of leading zeros observed across all hash values is a probabilistic indicator of the set size. If the longest run of leading zeros is k, then the estimated cardinality is approximately 2^k. HyperLogLog refines this idea using stochastic averaging: it divides elements into m substreams (registers) based on the first few bits of their hash, tracks the maximum leading-zero count in each register, and combines the per-register estimates using a harmonic mean.
With m registers (each storing a small integer), HyperLogLog uses only m * 6 bits of memory while providing cardinality estimates with a standard error of approximately 1.04 / sqrt(m). Using 2^14 = 16,384 registers (12 KB of memory), it achieves approximately 0.8% error, which is sufficient for most analytical purposes. The algorithm also supports merging of sketches, enabling distributed and parallel cardinality estimation.
Usage
HyperLogLog is used in DuckDB to implement the `APPROX_COUNT_DISTINCT` aggregate function. This function provides fast approximate distinct counts over large datasets where exact counting would be too memory-intensive. It is particularly useful in exploratory data analysis, dashboard queries, and scenarios where a close approximation is acceptable in exchange for significantly reduced memory and computation.
Theoretical Basis
Core Algorithm: The HyperLogLog estimation process:
// Initialization
m = 2^p // number of registers (p = precision bits)
registers = array[m] of 0 // each stores max leading zeros
// Add element
function add(element):
h = hash(element) // 64-bit hash
index = h >> (64 - p) // first p bits -> register
w = h << p // remaining bits
registers[index] = max(registers[index], leading_zeros(w) + 1)
// Estimate cardinality
function count():
Z = 1.0 / sum(2^(-registers[j]) for j in 0..m-1) // harmonic mean
E = alpha_m * m * m * Z // raw estimate
// alpha_m = correction constant depending on m
return apply_corrections(E, m, registers)
Bias Corrections: Small and large range corrections:
// Small range correction (linear counting)
if E <= 2.5 * m:
V = count registers equal to 0
if V > 0:
E = m * ln(m / V) // linear counting estimate
// Large range correction (hash collision)
if E > 2^32 / 30:
E = -2^32 * ln(1 - E / 2^32)
Merge Operation: Combining two HLL sketches:
// Merge sketch B into sketch A
function merge(A, B):
for j in 0..m-1:
A.registers[j] = max(A.registers[j], B.registers[j])
// The merged sketch estimates |A union B|
// This enables distributed counting across partitions
Error Bounds:
Standard error = 1.04 / sqrt(m)
Precision | Registers | Memory | Error
10 | 1024 | 768 B | 3.25%
12 | 4096 | 3 KB | 1.63%
14 | 16384 | 12 KB | 0.81%
16 | 65536 | 48 KB | 0.41%