Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb Brotli Dictionary

From Leeroopedia


Knowledge Sources
Domains Compression, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

The Brotli Dictionary module provides the static dictionary data and shared dictionary management facilities defined by RFC 7932, enabling the Brotli encoder and decoder to reference commonly occurring strings for improved compression ratios.

Description

This module consists of two source files and one header that together manage Brotli's dictionary subsystem:

dictionary.cpp (5912 lines) contains the static dictionary data as a large compile-time byte array kBrotliDictionaryData[]. This array encodes approximately 120,000 commonly occurring words and substrings from natural language text, HTML, CSS, JavaScript, and other web content. The data is organized into word lists bucketed by word length (4 to 24 bytes). Each bucket has a configurable number of entries controlled by the size_bits_by_length[32] array in the BrotliDictionary struct. The file provides two key functions:

  • BrotliGetDictionary() -- returns a pointer to the singleton BrotliDictionary struct with the default RFC 7932 dictionary data.
  • BrotliSetDictionaryData() -- allows providing external dictionary data (used in multi-client environments to share dictionary memory).

shared_dictionary.cpp (517 lines) implements the shared/compound dictionary functionality introduced in later Brotli revisions. It allows attaching custom dictionaries (beyond the built-in static dictionary) to encoder or decoder instances. Key operations include:

  • Parsing serialized dictionary formats with word lists, transforms, and context maps.
  • Reading variable-length encoded integers (ReadBool, ReadUint8, ReadUint16, ReadBignum32).
  • Computing word list offsets from size bits via BrotliSizeBitsToOffsets.
  • Parsing word lists and transforms from encoded shared dictionary data (ParseWordList).
  • Managing BrotliSharedDictionary instances through create/attach/destroy lifecycle.

The BrotliDictionary struct is defined in dictionary.h with the following layout:

  • size_bits_by_length[32] -- number of index bits per word length bucket (0 means no words of that length).
  • offsets_by_length[32] -- byte offsets into the data array for each length bucket.
  • data_size -- total size of the dictionary data.
  • data -- pointer to the raw dictionary byte array.

Usage

DuckDB uses Brotli compression for Parquet file I/O. The static dictionary is central to Brotli's compression efficiency: during encoding, the compressor searches the dictionary for matches and emits references (word index + transform) instead of literal bytes. During decoding, these references are resolved back to the original text using the same dictionary data. The shared dictionary module supports custom dictionaries for specialized use cases, though DuckDB primarily relies on the default RFC 7932 static dictionary. All dictionary operations are wrapped in the duckdb_brotli namespace to avoid symbol conflicts with system-installed Brotli libraries.

Code Reference

Source Location

  • Repository: Duckdb_Duckdb
  • File: third_party/brotli/common/dictionary.cpp (5912 lines) -- static dictionary data
  • File: third_party/brotli/common/shared_dictionary.cpp (517 lines) -- shared dictionary management
  • File: third_party/brotli/common/dictionary.h -- dictionary struct definition

Signature

// dictionary.h - BrotliDictionary struct
namespace duckdb_brotli {

typedef struct BrotliDictionary {
    uint8_t size_bits_by_length[32];
    uint32_t offsets_by_length[32];
    size_t data_size;
    const uint8_t* data;
} BrotliDictionary;

// Get the singleton default (RFC 7932) dictionary
BROTLI_COMMON_API const BrotliDictionary* BrotliGetDictionary(void);

// Set external dictionary data (for shared memory scenarios)
BROTLI_COMMON_API void BrotliSetDictionaryData(const uint8_t* data);

#define BROTLI_MIN_DICTIONARY_WORD_LENGTH 4
#define BROTLI_MAX_DICTIONARY_WORD_LENGTH 24

// shared_dictionary.cpp - internal helpers
static BROTLI_BOOL ReadBool(const uint8_t* encoded, size_t size,
    size_t* pos, BROTLI_BOOL* result);
static BROTLI_BOOL ReadUint8(const uint8_t* encoded, size_t size,
    size_t* pos, uint8_t* result);
static BROTLI_BOOL ReadUint16(const uint8_t* encoded, size_t size,
    size_t* pos, uint16_t* result);
static BROTLI_BOOL ReadBignum32(const uint8_t* encoded, size_t size,
    size_t* pos, uint32_t* result);
static size_t BrotliSizeBitsToOffsets(
    const uint8_t* size_bits_by_length,
    uint32_t* offsets_by_length);
static BROTLI_BOOL ParseWordList(size_t size, const uint8_t* encoded,
    size_t* pos, BrotliDictionary* out);

} // namespace duckdb_brotli

Import

#include "../common/dictionary.h"
#include <brotli/shared_dictionary.h>

I/O Contract

Inputs

Name Type Required Description
data const uint8_t* No External dictionary data pointer for BrotliSetDictionaryData
encoded const uint8_t* Yes Serialized shared dictionary byte stream for parsing functions
size size_t Yes Length of the encoded shared dictionary data
pos size_t* Yes Current read position in the encoded data (in/out parameter)
size_bits_by_length const uint8_t* Yes Array of 32 entries specifying bits per word length for offset computation

Outputs

Name Type Description
(return) const BrotliDictionary* Pointer to the singleton static dictionary struct from BrotliGetDictionary
out BrotliDictionary* Populated dictionary struct from ParseWordList
result BROTLI_BOOL* / uint8_t* / uint16_t* / uint32_t* Parsed value from shared dictionary read functions
offsets_by_length uint32_t* Computed byte offset array from BrotliSizeBitsToOffsets (return value is total word list length)

Usage Examples

// Accessing the static dictionary during decompression
#include "../common/dictionary.h"

using namespace duckdb_brotli;

// Get the default RFC 7932 dictionary
const BrotliDictionary* dict = BrotliGetDictionary();

// Look up a word of length 6 at index 42
int word_length = 6;
int word_index = 42;
uint32_t offset = dict->offsets_by_length[word_length]
    + (word_index * word_length);
const uint8_t* word = &dict->data[offset];

// During encoding, dictionary matches are found by hashing input
// and comparing against dictionary words of each supported length.
// The encoder emits (word_index, transform_index) pairs when a
// dictionary match yields better compression than LZ77 references.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment