Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb Brotli Transform

From Leeroopedia


Knowledge Sources
Domains Compression, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

The Brotli Transform module implements the word transformation algorithm defined in RFC 7932, which applies prefix/suffix additions, character omissions, case changes, and Unicode scalar shifts to static dictionary words during compression and decompression. (MANUAL_REVIEW APPROVED)

Description

transform.cpp (287 lines) defines the 121 word transforms specified by the Brotli format standard. Each transform is a triplet of (prefix_id, transform_type, suffix_id) that modifies a static dictionary word to produce a derived string. This mechanism dramatically increases the effective dictionary size without storing additional word data.

The module contains:

Static data tables:

  • kPrefixSuffix[217] -- a packed string containing all prefix and suffix fragments (e.g., " ", ", ", " of the ", " and ", ".com/", " the ", etc.). The first byte of each fragment encodes its length.
  • kPrefixSuffixMap[50] -- an index array mapping prefix/suffix IDs to byte offsets within kPrefixSuffix.
  • kTransformsData[] -- a 363-byte array (121 transforms x 3 bytes each) encoding the transform triplets.
  • kBrotliTransforms -- the singleton BrotliTransforms struct that aggregates all the above data, including the cutOffTransforms optimization array.

Transform types (defined in transform.h as BrotliWordTransformType enum):

  • BROTLI_TRANSFORM_IDENTITY (0) -- copy word as-is
  • BROTLI_TRANSFORM_OMIT_LAST_1 through BROTLI_TRANSFORM_OMIT_LAST_9 (1-9) -- truncate last N bytes
  • BROTLI_TRANSFORM_UPPERCASE_FIRST (10) -- uppercase the first character
  • BROTLI_TRANSFORM_UPPERCASE_ALL (11) -- uppercase all characters
  • BROTLI_TRANSFORM_OMIT_FIRST_1 through BROTLI_TRANSFORM_OMIT_FIRST_9 (12-20) -- skip first N bytes
  • BROTLI_TRANSFORM_SHIFT_FIRST (21) -- shift first Unicode scalar by a parameter
  • BROTLI_TRANSFORM_SHIFT_ALL (22) -- shift all Unicode scalars by a parameter

Key functions:

  • BrotliGetTransforms() -- returns a pointer to the singleton kBrotliTransforms struct.
  • BrotliTransformDictionaryWord() -- the core transform function that applies a specified transform to a dictionary word, writing the result (prefix + transformed word + suffix) to a destination buffer. It handles UTF-8 multi-byte sequences for uppercase and shift operations.
  • ToUpperCase() -- internal helper for ASCII and simplified UTF-8 uppercasing (1-byte ASCII toggle, 2-byte XOR on second byte, 3-byte XOR on third byte).
  • Shift() -- internal helper for Unicode scalar shifting across 1-byte to 4-byte UTF-8 sequences with sign extension.

Usage

DuckDB uses Brotli transforms implicitly through the Brotli encoder and decoder. When the encoder finds a match in the static dictionary, it selects the best (word_index, transform_index) pair. The decoder calls BrotliTransformDictionaryWord to reconstruct the original text from the dictionary reference and transform. The transform system is especially effective for web content, where common word forms (capitalized, suffixed with spaces or punctuation, etc.) are frequent. For example, transform #4 converts a word to its uppercase-first form, while transform #5 appends " " (space) after the word.

Code Reference

Source Location

  • Repository: Duckdb_Duckdb
  • File: third_party/brotli/common/transform.cpp (287 lines) -- transform data and algorithm
  • File: third_party/brotli/common/transform.h -- transform type enum and struct

Signature

namespace duckdb_brotli {

// Transform type enumeration (from transform.h)
enum BrotliWordTransformType {
    BROTLI_TRANSFORM_IDENTITY = 0,
    BROTLI_TRANSFORM_OMIT_LAST_1 = 1,
    // ... through BROTLI_TRANSFORM_OMIT_LAST_9 = 9
    BROTLI_TRANSFORM_UPPERCASE_FIRST = 10,
    BROTLI_TRANSFORM_UPPERCASE_ALL = 11,
    BROTLI_TRANSFORM_OMIT_FIRST_1 = 12,
    // ... through BROTLI_TRANSFORM_OMIT_FIRST_9 = 20
    BROTLI_TRANSFORM_SHIFT_FIRST = 21,
    BROTLI_TRANSFORM_SHIFT_ALL = 22,
    BROTLI_NUM_TRANSFORM_TYPES
};

// Transform data container (from transform.h)
typedef struct BrotliTransforms {
    uint16_t prefix_suffix_size;
    const uint8_t* prefix_suffix;
    const uint16_t* prefix_suffix_map;
    uint32_t num_transforms;
    const uint8_t* transforms;
    const uint8_t* params;
    int16_t cutOffTransforms[BROTLI_TRANSFORMS_MAX_CUT_OFF + 1];
} BrotliTransforms;

// Access macros
#define BROTLI_TRANSFORM_PREFIX_ID(T, I) ((T)->transforms[((I) * 3) + 0])
#define BROTLI_TRANSFORM_TYPE(T, I)      ((T)->transforms[((I) * 3) + 1])
#define BROTLI_TRANSFORM_SUFFIX_ID(T, I) ((T)->transforms[((I) * 3) + 2])
#define BROTLI_TRANSFORM_PREFIX(T, I)    (&(T)->prefix_suffix[...])
#define BROTLI_TRANSFORM_SUFFIX(T, I)    (&(T)->prefix_suffix[...])

// Public API (from transform.h)
BROTLI_COMMON_API const BrotliTransforms* BrotliGetTransforms(void);

BROTLI_COMMON_API int BrotliTransformDictionaryWord(
    uint8_t* dst,
    const uint8_t* word,
    int len,
    const BrotliTransforms* transforms,
    int transform_idx);

// Internal helpers (from transform.cpp)
static int ToUpperCase(uint8_t* p);
static int Shift(uint8_t* word, int word_len, uint16_t parameter);

} // namespace duckdb_brotli

Import

#include "../common/transform.h"

I/O Contract

Inputs

Name Type Required Description
dst uint8_t* Yes Destination buffer for the transformed word (must have sufficient space for prefix + word + suffix, max ~545 bytes)
word const uint8_t* Yes Pointer to the source dictionary word bytes
len int Yes Length of the source dictionary word in bytes (4-24 for RFC 7932)
transforms const BrotliTransforms* Yes Pointer to the transforms struct (typically from BrotliGetTransforms())
transform_idx int Yes Index of the transform to apply (0-120 for RFC 7932)

Outputs

Name Type Description
(return) int Total number of bytes written to dst (prefix_len + transformed_word_len + suffix_len)
dst uint8_t* Buffer filled with the transformed word: prefix bytes, then the (possibly truncated/case-modified) word, then suffix bytes

Usage Examples

// Applying a transform during Brotli decompression (from decode.cpp)
#include "../common/dictionary.h"
#include "../common/transform.h"

using namespace duckdb_brotli;

const BrotliDictionary* dict = BrotliGetDictionary();
const BrotliTransforms* transforms = BrotliGetTransforms();

// Suppose the decoder reads word_index=100, length=5, transform=4
int word_len = 5;
int word_idx = 100;
int transform_idx = 4;  // BROTLI_TRANSFORM_UPPERCASE_FIRST

uint32_t offset = dict->offsets_by_length[word_len]
    + (word_idx * word_len);
const uint8_t* word = &dict->data[offset];

uint8_t dest[256];
int result_len = BrotliTransformDictionaryWord(
    dest, word, word_len, transforms, transform_idx);
// dest now contains the transformed word with first letter uppercased

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment