Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Quantile Estimation

From Leeroopedia


Knowledge Sources
Domains Probabilistic_Data_Structure, Statistics
Last Updated 2026-02-07 12:00 GMT

Overview

A streaming algorithm for estimating quantiles and percentiles of a data distribution using a compact, mergeable digest structure that provides higher accuracy at the tails of the distribution.

Description

The T-Digest is a data structure and algorithm for computing approximate quantiles over streaming or distributed data, designed by Ted Dunning. It addresses a fundamental challenge in data analytics: computing exact quantiles requires storing all data points (or at least a sorted representation), which is infeasible for large datasets. T-Digest provides a compact summary that typically uses only a few kilobytes of memory regardless of the input size.

T-Digest works by maintaining a sorted collection of centroids, where each centroid represents a cluster of nearby data values with a count of how many values it represents. The key innovation is its scale function, which controls how centroids are allowed to merge. The scale function permits larger centroids in the middle of the distribution but forces smaller centroids near the tails (q near 0 or 1). This non-uniform resolution ensures that extreme quantiles (like p99 or p99.9) are estimated with higher accuracy than middle quantiles, which is exactly what most monitoring and analytics applications need.

The algorithm supports three key operations: adding individual values (or weighted values), merging two T-Digests into one, and querying for a quantile value. The merge operation makes T-Digest naturally suitable for distributed computation, where each partition computes a local digest that is later merged into a global result.

Usage

T-Digest is used in DuckDB to implement the `APPROX_QUANTILE` aggregate function. This provides approximate percentile calculations over large datasets without requiring the full dataset to be sorted or stored in memory. It is valuable for analytical queries computing p50, p95, p99 latency percentiles, distribution analysis, and exploratory statistics on columns too large for exact quantile computation.

Theoretical Basis

Centroid Representation: The digest is a sorted list of centroids:

Centroid = (mean, count)
T-Digest = sorted list of centroids [c1, c2, ..., cn]
  where c1.mean <= c2.mean <= ... <= cn.mean
Total count N = sum of all centroid counts

Scale Function: Controls centroid size by quantile position:

// Scale function k(q, delta) maps quantile q in [0,1] to scale space
// k1(q) = (delta / 2pi) * arcsin(2q - 1)   (default)
// This allows smaller centroids at tails, larger in middle

// Maximum centroid size at quantile q:
max_count(q, delta, N) proportional to:
    (4 * N / delta) * q * (1 - q)
// At q=0.5 (median): largest centroids
// At q=0 or q=1 (tails): smallest centroids -> highest accuracy

Adding a Value:

function add(digest, x, w=1):
    // Find nearest centroid to x
    nearest = find_nearest_centroid(digest, x)
    q = quantile_of(nearest, digest)

    // Check if merging would exceed size limit
    if nearest.count + w <= max_count(q, delta, N):
        // Merge into existing centroid
        nearest.mean = weighted_average(nearest.mean, nearest.count, x, w)
        nearest.count += w
    else:
        // Create new centroid
        insert_centroid(digest, Centroid(x, w))

    // Compress if too many centroids
    if len(digest.centroids) > threshold:
        compress(digest)

Quantile Query:

function quantile(digest, q):
    target = q * digest.total_count
    cumulative = 0
    for each centroid c in digest:
        if cumulative + c.count > target:
            // Interpolate within this centroid
            inner_q = (target - cumulative) / c.count
            // Interpolate between adjacent centroid means
            return interpolate(prev_centroid, c, next_centroid, inner_q)
        cumulative += c.count

Merge Operation:

function merge(digest_a, digest_b):
    // Combine all centroids and re-compress
    all_centroids = merge_sorted(digest_a.centroids, digest_b.centroids)
    result = new T-Digest(delta)
    // Process in random order to avoid bias
    shuffle(all_centroids)
    for each centroid c in all_centroids:
        result.add(c.mean, c.count)
    return result

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment