Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb HyperLogLog

From Leeroopedia
Revision as of 14:50, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Duckdb_Duckdb_HyperLogLog.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Approximate_Query_Processing, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

HyperLogLog is a probabilistic cardinality estimation algorithm, adapted from the Redis implementation by Salvatore Sanfilippo, that allows DuckDB to efficiently approximate the number of distinct elements in a dataset using fixed memory (approximately 12 KB per counter).

Description

The implementation resides in the duckdb_hll namespace and provides a C-style API operating on opaque robj structs (borrowed from Redis's object model). Key characteristics:

  • Dual representation: The HLL uses a sparse representation for counters with many zero registers (memory-efficient run-length encoding) and automatically converts to a dense representation (16,384 six-bit registers, 12 KB total) once the sparse representation exceeds HLL_SPARSE_MAX_BYTES (3000 bytes).
  • 64-bit hashing: Uses a 64-bit hash function as proposed by Heule, Nunkesser, and Hall, avoiding the need for bias correction near 2^32 that plagues 32-bit implementations.
  • 16,384 registers: Each register stores the maximum number of leading zeros observed for elements hashing to that register's bucket, yielding a standard error of approximately 0.81%.
  • SDS (Simple Dynamic Strings): The implementation uses Redis's SDS library (sds.cpp) internally for managing the variable-length binary data that backs the HLL structure.

The duckdb namespace provides two additional helper functions (AddToLogsInternal and AddToSingleLogInternal) that integrate the HLL with DuckDB's vectorized execution model, processing vectors of values in bulk.

Usage

DuckDB uses HyperLogLog to implement the APPROX_COUNT_DISTINCT aggregate function, which provides fast approximate distinct counts with bounded memory usage. This is particularly useful for large-scale analytics where an exact distinct count would be prohibitively expensive. The HLL structures are also mergeable, enabling parallel and distributed aggregation.

Code Reference

Source Location

Signature

namespace duckdb_hll {

struct robj {
    void *ptr;
};

// Lifecycle
robj *hll_create(void);
void  hll_destroy(robj *obj);

// Operations
int   hll_add(robj *o, unsigned char *ele, size_t elesize);
int   hll_count(robj *o, size_t *result);
robj *hll_merge(robj **hlls, size_t hll_count);

// Representation
int   hllSparseToDense(robj *o);

// Inspection
uint64_t get_size();
uint64_t num_registers();
uint8_t  maximum_zeros();
uint8_t  get_register(robj *o, size_t index);
void     set_register(robj *o, size_t index, uint8_t count);

} // namespace duckdb_hll

namespace duckdb {

// Vectorized helpers for DuckDB's execution engine
void AddToLogsInternal(UnifiedVectorFormat &vdata, idx_t count,
                       uint64_t indices[], uint8_t counts[],
                       void ***logs[], const SelectionVector *log_sel);

void AddToSingleLogInternal(UnifiedVectorFormat &vdata, idx_t count,
                            uint64_t indices[], uint8_t counts[], void *log);

} // namespace duckdb

Import

#include "hyperloglog/hyperloglog.hpp"

I/O Contract

Inputs

Name Type Required Description
o robj* Yes Pointer to an HLL object created by hll_create()
ele unsigned char* Yes (add) Pointer to the element bytes to add to the HLL
elesize size_t Yes (add) Size in bytes of the element to add
hlls robj** Yes (merge) Array of HLL object pointers to merge
hll_count size_t Yes (merge) Number of HLL objects in the merge array
index size_t Yes (register ops) Register index (0 to 16383)

Outputs

Name Type Description
return (hll_create) robj* Newly allocated empty HLL object
return (hll_add) int 1 if cardinality changed, 0 if unchanged, HLL_C_ERR (-1) on failure
result (hll_count) size_t* Estimated distinct count written to this pointer; function returns HLL_C_OK or HLL_C_ERR
return (hll_merge) robj* New merged HLL object, or NULL on failure

Usage Examples

#include "hyperloglog/hyperloglog.hpp"

// Create a new HyperLogLog counter
duckdb_hll::robj *hll = duckdb_hll::hll_create();

// Add elements to the HLL
const char *value = "hello";
int changed = duckdb_hll::hll_add(hll, (unsigned char *)value, strlen(value));

// Get the estimated distinct count
size_t count;
int rc = duckdb_hll::hll_count(hll, &count);

// Merge multiple HLLs (e.g., from parallel workers)
duckdb_hll::robj *hll2 = duckdb_hll::hll_create();
duckdb_hll::robj *hlls[] = {hll, hll2};
duckdb_hll::robj *merged = duckdb_hll::hll_merge(hlls, 2);

// Cleanup
duckdb_hll::hll_destroy(hll);
duckdb_hll::hll_destroy(hll2);
duckdb_hll::hll_destroy(merged);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment