Implementation:Duckdb Duckdb Brotli Dictionary
| Knowledge Sources | |
|---|---|
| Domains | Compression, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
The Brotli Dictionary module provides the static dictionary data and shared dictionary management facilities defined by RFC 7932, enabling the Brotli encoder and decoder to reference commonly occurring strings for improved compression ratios.
Description
This module consists of two source files and one header that together manage Brotli's dictionary subsystem:
dictionary.cpp (5912 lines) contains the static dictionary data as a large compile-time byte array kBrotliDictionaryData[]. This array encodes approximately 120,000 commonly occurring words and substrings from natural language text, HTML, CSS, JavaScript, and other web content. The data is organized into word lists bucketed by word length (4 to 24 bytes). Each bucket has a configurable number of entries controlled by the size_bits_by_length[32] array in the BrotliDictionary struct. The file provides two key functions:
BrotliGetDictionary()-- returns a pointer to the singletonBrotliDictionarystruct with the default RFC 7932 dictionary data.BrotliSetDictionaryData()-- allows providing external dictionary data (used in multi-client environments to share dictionary memory).
shared_dictionary.cpp (517 lines) implements the shared/compound dictionary functionality introduced in later Brotli revisions. It allows attaching custom dictionaries (beyond the built-in static dictionary) to encoder or decoder instances. Key operations include:
- Parsing serialized dictionary formats with word lists, transforms, and context maps.
- Reading variable-length encoded integers (
ReadBool,ReadUint8,ReadUint16,ReadBignum32). - Computing word list offsets from size bits via
BrotliSizeBitsToOffsets. - Parsing word lists and transforms from encoded shared dictionary data (
ParseWordList). - Managing
BrotliSharedDictionaryinstances through create/attach/destroy lifecycle.
The BrotliDictionary struct is defined in dictionary.h with the following layout:
size_bits_by_length[32]-- number of index bits per word length bucket (0 means no words of that length).offsets_by_length[32]-- byte offsets into the data array for each length bucket.data_size-- total size of the dictionary data.data-- pointer to the raw dictionary byte array.
Usage
DuckDB uses Brotli compression for Parquet file I/O. The static dictionary is central to Brotli's compression efficiency: during encoding, the compressor searches the dictionary for matches and emits references (word index + transform) instead of literal bytes. During decoding, these references are resolved back to the original text using the same dictionary data. The shared dictionary module supports custom dictionaries for specialized use cases, though DuckDB primarily relies on the default RFC 7932 static dictionary. All dictionary operations are wrapped in the duckdb_brotli namespace to avoid symbol conflicts with system-installed Brotli libraries.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- File:
third_party/brotli/common/dictionary.cpp(5912 lines) -- static dictionary data - File:
third_party/brotli/common/shared_dictionary.cpp(517 lines) -- shared dictionary management - File:
third_party/brotli/common/dictionary.h-- dictionary struct definition
Signature
// dictionary.h - BrotliDictionary struct
namespace duckdb_brotli {
typedef struct BrotliDictionary {
uint8_t size_bits_by_length[32];
uint32_t offsets_by_length[32];
size_t data_size;
const uint8_t* data;
} BrotliDictionary;
// Get the singleton default (RFC 7932) dictionary
BROTLI_COMMON_API const BrotliDictionary* BrotliGetDictionary(void);
// Set external dictionary data (for shared memory scenarios)
BROTLI_COMMON_API void BrotliSetDictionaryData(const uint8_t* data);
#define BROTLI_MIN_DICTIONARY_WORD_LENGTH 4
#define BROTLI_MAX_DICTIONARY_WORD_LENGTH 24
// shared_dictionary.cpp - internal helpers
static BROTLI_BOOL ReadBool(const uint8_t* encoded, size_t size,
size_t* pos, BROTLI_BOOL* result);
static BROTLI_BOOL ReadUint8(const uint8_t* encoded, size_t size,
size_t* pos, uint8_t* result);
static BROTLI_BOOL ReadUint16(const uint8_t* encoded, size_t size,
size_t* pos, uint16_t* result);
static BROTLI_BOOL ReadBignum32(const uint8_t* encoded, size_t size,
size_t* pos, uint32_t* result);
static size_t BrotliSizeBitsToOffsets(
const uint8_t* size_bits_by_length,
uint32_t* offsets_by_length);
static BROTLI_BOOL ParseWordList(size_t size, const uint8_t* encoded,
size_t* pos, BrotliDictionary* out);
} // namespace duckdb_brotli
Import
#include "../common/dictionary.h"
#include <brotli/shared_dictionary.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | const uint8_t* |
No | External dictionary data pointer for BrotliSetDictionaryData
|
| encoded | const uint8_t* |
Yes | Serialized shared dictionary byte stream for parsing functions |
| size | size_t |
Yes | Length of the encoded shared dictionary data |
| pos | size_t* |
Yes | Current read position in the encoded data (in/out parameter) |
| size_bits_by_length | const uint8_t* |
Yes | Array of 32 entries specifying bits per word length for offset computation |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | const BrotliDictionary* |
Pointer to the singleton static dictionary struct from BrotliGetDictionary
|
| out | BrotliDictionary* |
Populated dictionary struct from ParseWordList
|
| result | BROTLI_BOOL* / uint8_t* / uint16_t* / uint32_t* |
Parsed value from shared dictionary read functions |
| offsets_by_length | uint32_t* |
Computed byte offset array from BrotliSizeBitsToOffsets (return value is total word list length)
|
Usage Examples
// Accessing the static dictionary during decompression
#include "../common/dictionary.h"
using namespace duckdb_brotli;
// Get the default RFC 7932 dictionary
const BrotliDictionary* dict = BrotliGetDictionary();
// Look up a word of length 6 at index 42
int word_length = 6;
int word_index = 42;
uint32_t offset = dict->offsets_by_length[word_length]
+ (word_index * word_length);
const uint8_t* word = &dict->data[offset];
// During encoding, dictionary matches are found by hashing input
// and comparing against dictionary words of each supported length.
// The encoder emits (word_index, transform_index) pairs when a
// dictionary match yields better compression than LZ77 references.