Implementation:Duckdb Duckdb Brotli Encoder
| Knowledge Sources | |
|---|---|
| Domains | Compression, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
The Brotli Encoder module implements the complete Brotli compression algorithm, providing both one-shot and streaming compression interfaces with configurable quality levels (0-11), sliding window sizes, and multiple hash table strategies used by DuckDB to write Brotli-compressed Parquet data.
Description
This module consists of 19 source files that together form the full Brotli encoding pipeline. The major components are organized into the following functional areas:
Main Encoder (encode.cpp, 1990 lines): The central orchestrator of the compression process. It implements:
BrotliEncoderCreateInstance/BrotliEncoderDestroyInstance-- lifecycle management with custom allocator support.BrotliEncoderSetParameter-- configures quality (0-11), window size (10-24 bits, up to 30 in large window mode), input block size, compression mode (GENERIC, TEXT, FONT), literal context modeling, size hints, and distance coding parameters (NPOSTFIX, NDIRECT, STREAM_OFFSET).BrotliEncoderCompress-- one-shot memory-to-memory compression.BrotliEncoderCompressStream-- streaming compression with PROCESS, FLUSH, FINISH, and EMIT_METADATA operations.BrotliEncoderMaxCompressedSize-- calculates worst-case output size bound.BrotliEncoderIsFinished,BrotliEncoderHasMoreOutput,BrotliEncoderTakeOutput-- state queries and zero-copy output access.
Backward References (backward_references.cpp, 3775 lines; backward_references_hq.cpp, 935 lines):
Finds LZ77 matches in the input data. The standard module uses template-expanded hash functions (H2 through H65) for different quality levels and input characteristics. The high-quality module (_hq) implements the Zopfli-like optimal parsing algorithm used at quality levels 10-11, which considers the cost of encoding each match to find the globally optimal set of backward references.
Hash Tables (brotli_hash.h, 4352 lines):
Defines the HasherCommon struct and multiple hash table implementations:
- Hash tables are parameterized by bucket count, block size, and hash length.
HasherCommontracks setup state, dictionary lookup statistics, and hasher parameters.- Different hashers (H2, H3, H4, H5, H6, H40, H41, H42, H54, H55, H65, etc.) trade off memory usage and speed against match quality.
Block Splitter (block_splitter.cpp, 1653 lines):
Splits the input into block types for literals, insert-and-copy commands, and distances. Uses histogram-based clustering with configurable block switch costs (kLiteralBlockSwitchCost = 28.1, kCommandBlockSwitchCost = 13.5, kDistanceBlockSwitchCost = 14.6). Each block type uses its own Huffman tree, improving compression by adapting to local data characteristics.
Clustering (cluster.cpp, 1025 lines; cluster.h, 1017 lines): Implements histogram clustering algorithms that merge similar block type histograms to reduce the number of distinct Huffman trees. Uses a pair-cost matrix and iterative merging with entropy-based cost functions.
Metablock Construction (metablock.cpp, 1225 lines):
Distributes literals and commands between block types and contexts. Computes distance parameters (BrotliInitDistanceParams), builds context maps, and organizes data into metablocks -- the top-level encoding units in the Brotli format.
Bit Stream Writing (brotli_bit_stream.cpp, 1431 lines): Serializes the compressed data into the Brotli bitstream format. Encodes metablock headers, Huffman trees, block switch commands, literals, insert-and-copy lengths, and distances according to RFC 7932.
Entropy Encoding (entropy_encode.cpp, 500 lines; entropy_encode_static.h, 538 lines):
Builds Huffman trees from symbol frequency histograms. BrotliSetDepth computes code depths, and SortHuffmanTree orders nodes by frequency. The static header contains pre-computed Huffman codes for common distributions.
Bit Cost Estimation (bit_cost.cpp, 410 lines): Calculates the estimated bit cost of encoding symbols using a given histogram, used by the block splitter and clustering algorithms to make splitting and merging decisions.
Fragment Compression (compress_fragment.cpp, 796 lines; compress_fragment_two_pass.cpp, 653 lines): Fast compression paths for quality levels 0-1. The single-pass fragment compressor is optimized for speed, while the two-pass variant improves compression ratio slightly by making a preliminary pass to gather statistics.
Dictionary Hash (dictionary_hash.cpp, 1844 lines): Pre-computed hash values for all words in the static dictionary, enabling O(1) dictionary match lookups during encoding.
Encoder Dictionary (encoder_dict.cpp, 636 lines): Manages the encoder's dictionary state, including prepared dictionaries and compound dictionary support.
Static Dictionary Matching (static_dict.cpp, 538 lines; static_dict_lut.h, 5862 lines): Implements the algorithm to find the best matching word in the Brotli static dictionary for a given input substring. The lookup table header contains pre-computed data structures for efficient dictionary searching.
Public API (encode.h, 489 lines): Defines the public encoder interface including:
BrotliEncoderMode-- GENERIC, TEXT, FONT.BrotliEncoderOperation-- PROCESS, FLUSH, FINISH, EMIT_METADATA.BrotliEncoderParameter-- MODE, QUALITY, LGWIN, LGBLOCK, DISABLE_LITERAL_CONTEXT_MODELING, SIZE_HINT, LARGE_WINDOW, NPOSTFIX, NDIRECT, STREAM_OFFSET.- Quality range: 0 (fastest) to 11 (best compression).
- Window size range: 10-24 bits (standard), up to 30 bits (large window).
Usage
DuckDB uses the Brotli encoder to compress column chunks when writing Parquet files with Brotli compression. The typical usage path involves creating an encoder instance with BrotliEncoderCreateInstance, configuring quality and window size via BrotliEncoderSetParameter, and then feeding data through BrotliEncoderCompressStream with PROCESS operations followed by a FINISH operation. Alternatively, the one-shot BrotliEncoderCompress can be used for small, self-contained data blocks. The encoder operates within the duckdb_brotli namespace to avoid conflicts with system-installed Brotli libraries. Quality level selection allows DuckDB to trade compression speed against file size based on user preferences.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- File:
third_party/brotli/enc/encode.cpp(1990 lines) -- main encoder - File:
third_party/brotli/enc/backward_references.cpp(3775 lines) -- LZ77 match finding - File:
third_party/brotli/enc/backward_references_hq.cpp(935 lines) -- optimal parsing - File:
third_party/brotli/enc/bit_cost.cpp(410 lines) -- bit cost estimation - File:
third_party/brotli/enc/block_splitter.cpp(1653 lines) -- block type splitting - File:
third_party/brotli/enc/brotli_bit_stream.cpp(1431 lines) -- bitstream serialization - File:
third_party/brotli/enc/brotli_hash.h(4352 lines) -- hash table implementations - File:
third_party/brotli/enc/cluster.cpp(1025 lines) -- histogram clustering - File:
third_party/brotli/enc/cluster.h(1017 lines) -- clustering declarations - File:
third_party/brotli/enc/compress_fragment.cpp(796 lines) -- fast single-pass compression - File:
third_party/brotli/enc/compress_fragment_two_pass.cpp(653 lines) -- two-pass compression - File:
third_party/brotli/enc/dictionary_hash.cpp(1844 lines) -- dictionary hash data - File:
third_party/brotli/enc/encoder_dict.cpp(636 lines) -- encoder dictionary management - File:
third_party/brotli/enc/entropy_encode.cpp(500 lines) -- Huffman tree building - File:
third_party/brotli/enc/entropy_encode_static.h(538 lines) -- static Huffman codes - File:
third_party/brotli/enc/metablock.cpp(1225 lines) -- metablock construction - File:
third_party/brotli/enc/static_dict.cpp(538 lines) -- static dictionary matching - File:
third_party/brotli/enc/static_dict_lut.h(5862 lines) -- dictionary lookup tables - File:
third_party/brotli/include/brotli/encode.h(489 lines) -- public encode API
Signature
// Encoder mode (from encode.h)
typedef enum BrotliEncoderMode {
BROTLI_MODE_GENERIC = 0,
BROTLI_MODE_TEXT = 1,
BROTLI_MODE_FONT = 2
} BrotliEncoderMode;
namespace duckdb_brotli {
// Encoder operations
typedef enum BrotliEncoderOperation {
BROTLI_OPERATION_PROCESS = 0,
BROTLI_OPERATION_FLUSH = 1,
BROTLI_OPERATION_FINISH = 2,
BROTLI_OPERATION_EMIT_METADATA = 3
} BrotliEncoderOperation;
// Encoder parameters
typedef enum BrotliEncoderParameter {
BROTLI_PARAM_MODE = 0,
BROTLI_PARAM_QUALITY = 1,
BROTLI_PARAM_LGWIN = 2,
BROTLI_PARAM_LGBLOCK = 3,
BROTLI_PARAM_DISABLE_LITERAL_CONTEXT_MODELING = 4,
BROTLI_PARAM_SIZE_HINT = 5,
BROTLI_PARAM_LARGE_WINDOW = 6,
BROTLI_PARAM_NPOSTFIX = 7,
BROTLI_PARAM_NDIRECT = 8,
BROTLI_PARAM_STREAM_OFFSET = 9
} BrotliEncoderParameter;
// Opaque encoder state
typedef struct BrotliEncoderStateStruct BrotliEncoderState;
// Lifecycle
BROTLI_ENC_API BrotliEncoderState* BrotliEncoderCreateInstance(
brotli_alloc_func alloc_func,
brotli_free_func free_func,
void* opaque);
BROTLI_ENC_API void BrotliEncoderDestroyInstance(
BrotliEncoderState* state);
// Configuration
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderSetParameter(
BrotliEncoderState* state,
BrotliEncoderParameter param,
uint32_t value);
// One-shot compression
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderCompress(
int quality, int lgwin, BrotliEncoderMode mode,
size_t input_size,
const uint8_t input_buffer[],
size_t* encoded_size,
uint8_t encoded_buffer[]);
// Streaming compression
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderCompressStream(
BrotliEncoderState* state,
BrotliEncoderOperation op,
size_t* available_in, const uint8_t** next_in,
size_t* available_out, uint8_t** next_out,
size_t* total_out);
// Output size estimation
BROTLI_ENC_API size_t BrotliEncoderMaxCompressedSize(
size_t input_size);
// State queries
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderIsFinished(
BrotliEncoderState* state);
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderHasMoreOutput(
BrotliEncoderState* state);
BROTLI_ENC_API const uint8_t* BrotliEncoderTakeOutput(
BrotliEncoderState* state, size_t* size);
// Prepared dictionary support
typedef struct BrotliEncoderPreparedDictionaryStruct
BrotliEncoderPreparedDictionary;
BROTLI_ENC_API BrotliEncoderPreparedDictionary*
BrotliEncoderPrepareDictionary(
BrotliSharedDictionaryType type,
size_t data_size,
const uint8_t data[],
int quality,
brotli_alloc_func alloc_func,
brotli_free_func free_func,
void* opaque);
BROTLI_ENC_API void BrotliEncoderDestroyPreparedDictionary(
BrotliEncoderPreparedDictionary* dictionary);
BROTLI_ENC_API BROTLI_BOOL BrotliEncoderAttachPreparedDictionary(
BrotliEncoderState* state,
const BrotliEncoderPreparedDictionary* dictionary);
// Internal structures (from brotli_hash.h)
typedef struct {
void* extra[4];
BROTLI_BOOL is_setup_;
size_t dict_num_lookups;
size_t dict_num_matches;
BrotliHasherParams params;
BROTLI_BOOL is_prepared_;
} HasherCommon;
// Distance code computation (from backward_references.cpp)
static BROTLI_INLINE size_t ComputeDistanceCode(
size_t distance,
size_t max_distance,
const int* dist_cache);
// Distance parameters (from metablock.cpp)
void BrotliInitDistanceParams(
BrotliDistanceParams* dist_params,
uint32_t npostfix, uint32_t ndirect,
BROTLI_BOOL large_window);
// Entropy encoding (from entropy_encode.cpp)
BROTLI_BOOL BrotliSetDepth(
int p0, HuffmanTree* pool,
uint8_t* depth, int max_depth);
} // namespace duckdb_brotli
Import
#include <brotli/encode.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| alloc_func | brotli_alloc_func |
No | Custom memory allocation function (NULL for default malloc) |
| free_func | brotli_free_func |
No | Custom memory free function (NULL for default free) |
| opaque | void* |
No | Opaque pointer passed to custom allocator functions |
| quality | int |
Yes | Compression quality (0 = fastest, 11 = best), default 11 |
| lgwin | int |
Yes | Log2 of sliding window size (10-24, up to 30 for large window), default 22 |
| mode | BrotliEncoderMode |
Yes | Input type hint: GENERIC, TEXT, or FONT |
| input_buffer | const uint8_t* |
Yes | Uncompressed input data buffer |
| input_size | size_t |
Yes | Size of uncompressed input data in bytes |
| available_in | size_t* |
Yes | Pointer to available input byte count (streaming API, in/out) |
| next_in | const uint8_t** |
Yes | Pointer to next input byte pointer (streaming API, in/out) |
| available_out | size_t* |
Yes | Pointer to available output space (streaming API, in/out) |
| next_out | uint8_t** |
Yes | Pointer to output buffer cursor (streaming API, in/out) |
| op | BrotliEncoderOperation |
Yes | Streaming operation: PROCESS, FLUSH, FINISH, or EMIT_METADATA |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | BROTLI_BOOL |
TRUE on success, FALSE on error (for compress and streaming APIs) |
| encoded_buffer | uint8_t* |
Buffer filled with compressed data (one-shot API) |
| encoded_size | size_t* |
Number of compressed bytes written (one-shot API, in/out) |
| total_out | size_t* |
Cumulative bytes compressed since initialization (streaming API) |
| next_out | uint8_t** |
Updated output cursor after compression (streaming API) |
| available_out | size_t* |
Remaining output buffer space (streaming API) |
Usage Examples
// One-shot compression in DuckDB context
#include <brotli/encode.h>
using namespace duckdb_brotli;
// Compress a Parquet column chunk with Brotli
int quality = BROTLI_DEFAULT_QUALITY; // 11
int lgwin = BROTLI_DEFAULT_WINDOW; // 22
BrotliEncoderMode mode = BROTLI_MODE_GENERIC;
size_t max_output = BrotliEncoderMaxCompressedSize(input_size);
uint8_t* compressed = (uint8_t*)malloc(max_output);
size_t compressed_size = max_output;
BROTLI_BOOL ok = BrotliEncoderCompress(
quality, lgwin, mode,
input_size, input_data,
&compressed_size, compressed);
if (!ok) {
// Handle compression error
}
// compressed_size now contains actual compressed size
// Streaming compression example
BrotliEncoderState* state = BrotliEncoderCreateInstance(
NULL, NULL, NULL);
BrotliEncoderSetParameter(state, BROTLI_PARAM_QUALITY, 6);
BrotliEncoderSetParameter(state, BROTLI_PARAM_LGWIN, 20);
size_t avail_in = input_size;
const uint8_t* next_in_ptr = input_data;
size_t avail_out = output_buffer_size;
uint8_t* next_out_ptr = output_buffer;
size_t total = 0;
// Process input
BrotliEncoderCompressStream(state,
BROTLI_OPERATION_PROCESS,
&avail_in, &next_in_ptr,
&avail_out, &next_out_ptr, &total);
// Finalize the stream
avail_in = 0;
BrotliEncoderCompressStream(state,
BROTLI_OPERATION_FINISH,
&avail_in, &next_in_ptr,
&avail_out, &next_out_ptr, &total);
while (BrotliEncoderHasMoreOutput(state)) {
size_t size = 0;
const uint8_t* output = BrotliEncoderTakeOutput(state, &size);
// Write output bytes to destination
}
BrotliEncoderDestroyInstance(state);