Principle:Duckdb Duckdb Approximate Distinct Counting

Knowledge Sources	HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm HyperLogLog in Practice
Domains	Probabilistic_Data_Structure, Cardinality_Estimation
Last Updated	2026-02-07 12:00 GMT

Overview

A probabilistic algorithm for estimating the number of distinct elements in a multiset using sub-linear space, providing approximate cardinality counts with controllable accuracy.

Description

HyperLogLog (HLL) is a probabilistic cardinality estimation algorithm that can count the number of distinct elements in a dataset using a fixed, small amount of memory. Invented by Philippe Flajolet et al. in 2007, HyperLogLog builds on the observation that the cardinality of a set can be estimated from the statistical properties of hash values computed from its elements.

The key insight is that if you hash each element to a uniformly distributed binary string, the maximum number of leading zeros observed across all hash values is a probabilistic indicator of the set size. If the longest run of leading zeros is k, then the estimated cardinality is approximately 2^k. HyperLogLog refines this idea using stochastic averaging: it divides elements into m substreams (registers) based on the first few bits of their hash, tracks the maximum leading-zero count in each register, and combines the per-register estimates using a harmonic mean.

With m registers (each storing a small integer), HyperLogLog uses only m * 6 bits of memory while providing cardinality estimates with a standard error of approximately 1.04 / sqrt(m). Using 2^14 = 16,384 registers (12 KB of memory), it achieves approximately 0.8% error, which is sufficient for most analytical purposes. The algorithm also supports merging of sketches, enabling distributed and parallel cardinality estimation.

Usage

HyperLogLog is used in DuckDB to implement the `APPROX_COUNT_DISTINCT` aggregate function. This function provides fast approximate distinct counts over large datasets where exact counting would be too memory-intensive. It is particularly useful in exploratory data analysis, dashboard queries, and scenarios where a close approximation is acceptable in exchange for significantly reduced memory and computation.

Theoretical Basis

Core Algorithm: The HyperLogLog estimation process:

// Initialization
m = 2^p                    // number of registers (p = precision bits)
registers = array[m] of 0  // each stores max leading zeros

// Add element
function add(element):
    h = hash(element)                    // 64-bit hash
    index = h >> (64 - p)               // first p bits -> register
    w = h << p                          // remaining bits
    registers[index] = max(registers[index], leading_zeros(w) + 1)

// Estimate cardinality
function count():
    Z = 1.0 / sum(2^(-registers[j]) for j in 0..m-1)  // harmonic mean
    E = alpha_m * m * m * Z     // raw estimate
    // alpha_m = correction constant depending on m
    return apply_corrections(E, m, registers)

Bias Corrections: Small and large range corrections:

// Small range correction (linear counting)
if E <= 2.5 * m:
    V = count registers equal to 0
    if V > 0:
        E = m * ln(m / V)    // linear counting estimate

// Large range correction (hash collision)
if E > 2^32 / 30:
    E = -2^32 * ln(1 - E / 2^32)

Merge Operation: Combining two HLL sketches:

// Merge sketch B into sketch A
function merge(A, B):
    for j in 0..m-1:
        A.registers[j] = max(A.registers[j], B.registers[j])
// The merged sketch estimates |A union B|
// This enables distributed counting across partitions

Error Bounds:

Standard error = 1.04 / sqrt(m)

Precision | Registers | Memory  | Error
   10     |   1024    | 768 B   | 3.25%
   12     |   4096    |  3 KB   | 1.63%
   14     |  16384    | 12 KB   | 0.81%
   16     |  65536    | 48 KB   | 0.41%

Related Pages

Implementation:Duckdb_Duckdb_HyperLogLog

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment