Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb TDigest

From Leeroopedia


Knowledge Sources
Domains Approximate_Query_Processing, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

TDigest is a data structure for accurate estimation of quantiles and cumulative distribution functions (CDFs) over streaming data, implemented as a C++ header-only library in the duckdb_tdigest namespace, enabling DuckDB to compute approximate percentiles with bounded memory.

Description

The TDigest algorithm maintains a sorted collection of weighted centroids that summarize the distribution of observed values. The implementation consists of two main classes:

  • Centroid -- Represents a cluster of nearby values by their weighted mean and total weight. Centroids are merged together using an online weighted mean update: mean_ += c.weight_ * (c.mean_ - mean_) / weight_.
  • TDigest -- The main data structure, parameterized by a compression factor (default: 1000) that controls the trade-off between accuracy and memory. Key characteristics:
    • Two-phase processing: New values are buffered in an unprocessed vector. When the buffer reaches capacity (maxUnprocessed_ = 8 * ceil(compression)), they are merged into the processed centroid list (maxProcessed_ = 2 * ceil(compression)).
    • High-water mark: A batch merge threshold (kHighWater = 40000) controls when multiple TDigests are merged together.
    • Quantile estimation: The quantile(q) method returns the estimated value at a given quantile (0.0 to 1.0) using linear interpolation between centroids, with special handling for the boundary centroids near min and max.
    • CDF computation: The cdf(x) method returns the estimated fraction of values less than or equal to x.
    • Mergeability: Multiple TDigests can be merged together via merge() or batch add(), enabling parallel and distributed quantile estimation.

The implementation maintains cumulative weight sums for efficient interpolation and tracks the global min and max values for boundary accuracy.

Usage

DuckDB uses TDigest to implement approximate quantile functions such as APPROX_QUANTILE and related aggregate/window functions. The TDigest provides accurate estimates especially at the tails of distributions (near 0th and 100th percentiles) while using bounded memory proportional to the compression parameter. This makes it suitable for computing percentiles over large datasets without materializing the full sorted sequence.

Code Reference

Source Location

Signature

namespace duckdb_tdigest {

using Value  = double;
using Weight = double;
using Index  = size_t;

class Centroid {
public:
    Centroid();
    Centroid(Value mean, Weight weight);
    Value  mean()   const noexcept;
    Weight weight() const noexcept;
    void   add(const Centroid &c);
};

class TDigest {
public:
    TDigest();                                            // compression = 1000
    explicit TDigest(Value compression);
    TDigest(Value compression, Index bufferSize);
    TDigest(Value compression, Index unmergedSize, Index mergedSize);
    TDigest(std::vector<Centroid> &&processed,
            std::vector<Centroid> &&unprocessed,
            Value compression, Index unmergedSize, Index mergedSize);

    // Add a single value (with optional weight)
    void add(Value x);
    bool add(Value x, Weight w);

    // Add centroids from iterators
    void add(std::vector<Centroid>::const_iterator iter,
             std::vector<Centroid>::const_iterator end);

    // Merge another TDigest
    void merge(const TDigest *other);
    void add(std::vector<const TDigest *> digests);

    // Query
    Value quantile(Value q);          // q in [0.0, 1.0]
    Value cdf(Value x);               // returns fraction <= x

    // Force processing of buffered values
    void compress();

    // Inspection
    const std::vector<Centroid> &processed()   const;
    const std::vector<Centroid> &unprocessed() const;
    Weight processedWeight()    const;
    Weight unprocessedWeight()  const;
    size_t totalSize()          const;
    long   totalWeight()        const;
    Value  compression()        const;
    bool   haveUnprocessed()    const;
    bool   isDirty();
};

} // namespace duckdb_tdigest

Import

#include "tdigest/t_digest.hpp"

I/O Contract

Inputs

Name Type Required Description
x Value (double) Yes (add) The value to add to the digest
w Weight (double) No Weight of the value (default: 1.0)
q Value (double) Yes (quantile) Quantile to estimate, in [0.0, 1.0]
compression Value (double) No Controls accuracy vs. memory trade-off (default: 1000); higher = more accurate
other const TDigest* Yes (merge) Another TDigest to merge into this one

Outputs

Name Type Description
return (quantile) Value (double) Estimated value at the given quantile; NaN if q is out of range or digest is empty
return (cdf) Value (double) Estimated cumulative probability in [0.0, 1.0] for the given value
return (add) bool True if the value was added (false if NaN was passed)

Usage Examples

#include "tdigest/t_digest.hpp"

// Create a TDigest with default compression (1000)
duckdb_tdigest::TDigest td;

// Add values
for (int i = 0; i < 10000; i++) {
    td.add(static_cast<double>(i));
}

// Query quantiles
double median = td.quantile(0.5);      // ~5000
double p95    = td.quantile(0.95);     // ~9500
double p99    = td.quantile(0.99);     // ~9900

// Query CDF
double cdf_val = td.cdf(5000.0);      // ~0.5

// Merge TDigests from parallel workers
duckdb_tdigest::TDigest td1, td2;
// ... populate td1 and td2 ...
td1.merge(&td2);

// Lower compression for less memory (less accurate)
duckdb_tdigest::TDigest td_small(100);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment