Implementation:Duckdb Duckdb TDigest
| Knowledge Sources | |
|---|---|
| Domains | Approximate_Query_Processing, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
TDigest is a data structure for accurate estimation of quantiles and cumulative distribution functions (CDFs) over streaming data, implemented as a C++ header-only library in the duckdb_tdigest namespace, enabling DuckDB to compute approximate percentiles with bounded memory.
Description
The TDigest algorithm maintains a sorted collection of weighted centroids that summarize the distribution of observed values. The implementation consists of two main classes:
- Centroid -- Represents a cluster of nearby values by their weighted mean and total weight. Centroids are merged together using an online weighted mean update:
mean_ += c.weight_ * (c.mean_ - mean_) / weight_.
- TDigest -- The main data structure, parameterized by a compression factor (default: 1000) that controls the trade-off between accuracy and memory. Key characteristics:
- Two-phase processing: New values are buffered in an unprocessed vector. When the buffer reaches capacity (
maxUnprocessed_ = 8 * ceil(compression)), they are merged into the processed centroid list (maxProcessed_ = 2 * ceil(compression)). - High-water mark: A batch merge threshold (
kHighWater = 40000) controls when multiple TDigests are merged together. - Quantile estimation: The
quantile(q)method returns the estimated value at a given quantile (0.0 to 1.0) using linear interpolation between centroids, with special handling for the boundary centroids near min and max. - CDF computation: The
cdf(x)method returns the estimated fraction of values less than or equal to x. - Mergeability: Multiple TDigests can be merged together via
merge()or batchadd(), enabling parallel and distributed quantile estimation.
- Two-phase processing: New values are buffered in an unprocessed vector. When the buffer reaches capacity (
The implementation maintains cumulative weight sums for efficient interpolation and tracks the global min and max values for boundary accuracy.
Usage
DuckDB uses TDigest to implement approximate quantile functions such as APPROX_QUANTILE and related aggregate/window functions. The TDigest provides accurate estimates especially at the tails of distributions (near 0th and 100th percentiles) while using bounded memory proportional to the compression parameter. This makes it suitable for computing percentiles over large datasets without materializing the full sorted sequence.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files:
Signature
namespace duckdb_tdigest {
using Value = double;
using Weight = double;
using Index = size_t;
class Centroid {
public:
Centroid();
Centroid(Value mean, Weight weight);
Value mean() const noexcept;
Weight weight() const noexcept;
void add(const Centroid &c);
};
class TDigest {
public:
TDigest(); // compression = 1000
explicit TDigest(Value compression);
TDigest(Value compression, Index bufferSize);
TDigest(Value compression, Index unmergedSize, Index mergedSize);
TDigest(std::vector<Centroid> &&processed,
std::vector<Centroid> &&unprocessed,
Value compression, Index unmergedSize, Index mergedSize);
// Add a single value (with optional weight)
void add(Value x);
bool add(Value x, Weight w);
// Add centroids from iterators
void add(std::vector<Centroid>::const_iterator iter,
std::vector<Centroid>::const_iterator end);
// Merge another TDigest
void merge(const TDigest *other);
void add(std::vector<const TDigest *> digests);
// Query
Value quantile(Value q); // q in [0.0, 1.0]
Value cdf(Value x); // returns fraction <= x
// Force processing of buffered values
void compress();
// Inspection
const std::vector<Centroid> &processed() const;
const std::vector<Centroid> &unprocessed() const;
Weight processedWeight() const;
Weight unprocessedWeight() const;
size_t totalSize() const;
long totalWeight() const;
Value compression() const;
bool haveUnprocessed() const;
bool isDirty();
};
} // namespace duckdb_tdigest
Import
#include "tdigest/t_digest.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | Value (double) | Yes (add) | The value to add to the digest |
| w | Weight (double) | No | Weight of the value (default: 1.0) |
| q | Value (double) | Yes (quantile) | Quantile to estimate, in [0.0, 1.0] |
| compression | Value (double) | No | Controls accuracy vs. memory trade-off (default: 1000); higher = more accurate |
| other | const TDigest* | Yes (merge) | Another TDigest to merge into this one |
Outputs
| Name | Type | Description |
|---|---|---|
| return (quantile) | Value (double) | Estimated value at the given quantile; NaN if q is out of range or digest is empty |
| return (cdf) | Value (double) | Estimated cumulative probability in [0.0, 1.0] for the given value |
| return (add) | bool | True if the value was added (false if NaN was passed) |
Usage Examples
#include "tdigest/t_digest.hpp"
// Create a TDigest with default compression (1000)
duckdb_tdigest::TDigest td;
// Add values
for (int i = 0; i < 10000; i++) {
td.add(static_cast<double>(i));
}
// Query quantiles
double median = td.quantile(0.5); // ~5000
double p95 = td.quantile(0.95); // ~9500
double p99 = td.quantile(0.99); // ~9900
// Query CDF
double cdf_val = td.cdf(5000.0); // ~0.5
// Merge TDigests from parallel workers
duckdb_tdigest::TDigest td1, td2;
// ... populate td1 and td2 ...
td1.merge(&td2);
// Lower compression for less memory (less accurate)
duckdb_tdigest::TDigest td_small(100);