Principle:Duckdb Duckdb Quantile Estimation
| Knowledge Sources | |
|---|---|
| Domains | Probabilistic_Data_Structure, Statistics |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A streaming algorithm for estimating quantiles and percentiles of a data distribution using a compact, mergeable digest structure that provides higher accuracy at the tails of the distribution.
Description
The T-Digest is a data structure and algorithm for computing approximate quantiles over streaming or distributed data, designed by Ted Dunning. It addresses a fundamental challenge in data analytics: computing exact quantiles requires storing all data points (or at least a sorted representation), which is infeasible for large datasets. T-Digest provides a compact summary that typically uses only a few kilobytes of memory regardless of the input size.
T-Digest works by maintaining a sorted collection of centroids, where each centroid represents a cluster of nearby data values with a count of how many values it represents. The key innovation is its scale function, which controls how centroids are allowed to merge. The scale function permits larger centroids in the middle of the distribution but forces smaller centroids near the tails (q near 0 or 1). This non-uniform resolution ensures that extreme quantiles (like p99 or p99.9) are estimated with higher accuracy than middle quantiles, which is exactly what most monitoring and analytics applications need.
The algorithm supports three key operations: adding individual values (or weighted values), merging two T-Digests into one, and querying for a quantile value. The merge operation makes T-Digest naturally suitable for distributed computation, where each partition computes a local digest that is later merged into a global result.
Usage
T-Digest is used in DuckDB to implement the `APPROX_QUANTILE` aggregate function. This provides approximate percentile calculations over large datasets without requiring the full dataset to be sorted or stored in memory. It is valuable for analytical queries computing p50, p95, p99 latency percentiles, distribution analysis, and exploratory statistics on columns too large for exact quantile computation.
Theoretical Basis
Centroid Representation: The digest is a sorted list of centroids:
Centroid = (mean, count)
T-Digest = sorted list of centroids [c1, c2, ..., cn]
where c1.mean <= c2.mean <= ... <= cn.mean
Total count N = sum of all centroid counts
Scale Function: Controls centroid size by quantile position:
// Scale function k(q, delta) maps quantile q in [0,1] to scale space
// k1(q) = (delta / 2pi) * arcsin(2q - 1) (default)
// This allows smaller centroids at tails, larger in middle
// Maximum centroid size at quantile q:
max_count(q, delta, N) proportional to:
(4 * N / delta) * q * (1 - q)
// At q=0.5 (median): largest centroids
// At q=0 or q=1 (tails): smallest centroids -> highest accuracy
Adding a Value:
function add(digest, x, w=1):
// Find nearest centroid to x
nearest = find_nearest_centroid(digest, x)
q = quantile_of(nearest, digest)
// Check if merging would exceed size limit
if nearest.count + w <= max_count(q, delta, N):
// Merge into existing centroid
nearest.mean = weighted_average(nearest.mean, nearest.count, x, w)
nearest.count += w
else:
// Create new centroid
insert_centroid(digest, Centroid(x, w))
// Compress if too many centroids
if len(digest.centroids) > threshold:
compress(digest)
Quantile Query:
function quantile(digest, q):
target = q * digest.total_count
cumulative = 0
for each centroid c in digest:
if cumulative + c.count > target:
// Interpolate within this centroid
inner_q = (target - cumulative) / c.count
// Interpolate between adjacent centroid means
return interpolate(prev_centroid, c, next_centroid, inner_q)
cumulative += c.count
Merge Operation:
function merge(digest_a, digest_b):
// Combine all centroids and re-compress
all_centroids = merge_sorted(digest_a.centroids, digest_b.centroids)
result = new T-Digest(delta)
// Process in random order to avoid bias
shuffle(all_centroids)
for each centroid c in all_centroids:
result.add(c.mean, c.count)
return result