Implementation:Duckdb Duckdb Brotli Transform
| Knowledge Sources | |
|---|---|
| Domains | Compression, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
The Brotli Transform module implements the word transformation algorithm defined in RFC 7932, which applies prefix/suffix additions, character omissions, case changes, and Unicode scalar shifts to static dictionary words during compression and decompression. (MANUAL_REVIEW APPROVED)
Description
transform.cpp (287 lines) defines the 121 word transforms specified by the Brotli format standard. Each transform is a triplet of (prefix_id, transform_type, suffix_id) that modifies a static dictionary word to produce a derived string. This mechanism dramatically increases the effective dictionary size without storing additional word data.
The module contains:
Static data tables:
kPrefixSuffix[217]-- a packed string containing all prefix and suffix fragments (e.g., " ", ", ", " of the ", " and ", ".com/", " the ", etc.). The first byte of each fragment encodes its length.kPrefixSuffixMap[50]-- an index array mapping prefix/suffix IDs to byte offsets withinkPrefixSuffix.kTransformsData[]-- a 363-byte array (121 transforms x 3 bytes each) encoding the transform triplets.kBrotliTransforms-- the singletonBrotliTransformsstruct that aggregates all the above data, including thecutOffTransformsoptimization array.
Transform types (defined in transform.h as BrotliWordTransformType enum):
BROTLI_TRANSFORM_IDENTITY(0) -- copy word as-isBROTLI_TRANSFORM_OMIT_LAST_1throughBROTLI_TRANSFORM_OMIT_LAST_9(1-9) -- truncate last N bytesBROTLI_TRANSFORM_UPPERCASE_FIRST(10) -- uppercase the first characterBROTLI_TRANSFORM_UPPERCASE_ALL(11) -- uppercase all charactersBROTLI_TRANSFORM_OMIT_FIRST_1throughBROTLI_TRANSFORM_OMIT_FIRST_9(12-20) -- skip first N bytesBROTLI_TRANSFORM_SHIFT_FIRST(21) -- shift first Unicode scalar by a parameterBROTLI_TRANSFORM_SHIFT_ALL(22) -- shift all Unicode scalars by a parameter
Key functions:
BrotliGetTransforms()-- returns a pointer to the singletonkBrotliTransformsstruct.BrotliTransformDictionaryWord()-- the core transform function that applies a specified transform to a dictionary word, writing the result (prefix + transformed word + suffix) to a destination buffer. It handles UTF-8 multi-byte sequences for uppercase and shift operations.ToUpperCase()-- internal helper for ASCII and simplified UTF-8 uppercasing (1-byte ASCII toggle, 2-byte XOR on second byte, 3-byte XOR on third byte).Shift()-- internal helper for Unicode scalar shifting across 1-byte to 4-byte UTF-8 sequences with sign extension.
Usage
DuckDB uses Brotli transforms implicitly through the Brotli encoder and decoder. When the encoder finds a match in the static dictionary, it selects the best (word_index, transform_index) pair. The decoder calls BrotliTransformDictionaryWord to reconstruct the original text from the dictionary reference and transform. The transform system is especially effective for web content, where common word forms (capitalized, suffixed with spaces or punctuation, etc.) are frequent. For example, transform #4 converts a word to its uppercase-first form, while transform #5 appends " " (space) after the word.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- File:
third_party/brotli/common/transform.cpp(287 lines) -- transform data and algorithm - File:
third_party/brotli/common/transform.h-- transform type enum and struct
Signature
namespace duckdb_brotli {
// Transform type enumeration (from transform.h)
enum BrotliWordTransformType {
BROTLI_TRANSFORM_IDENTITY = 0,
BROTLI_TRANSFORM_OMIT_LAST_1 = 1,
// ... through BROTLI_TRANSFORM_OMIT_LAST_9 = 9
BROTLI_TRANSFORM_UPPERCASE_FIRST = 10,
BROTLI_TRANSFORM_UPPERCASE_ALL = 11,
BROTLI_TRANSFORM_OMIT_FIRST_1 = 12,
// ... through BROTLI_TRANSFORM_OMIT_FIRST_9 = 20
BROTLI_TRANSFORM_SHIFT_FIRST = 21,
BROTLI_TRANSFORM_SHIFT_ALL = 22,
BROTLI_NUM_TRANSFORM_TYPES
};
// Transform data container (from transform.h)
typedef struct BrotliTransforms {
uint16_t prefix_suffix_size;
const uint8_t* prefix_suffix;
const uint16_t* prefix_suffix_map;
uint32_t num_transforms;
const uint8_t* transforms;
const uint8_t* params;
int16_t cutOffTransforms[BROTLI_TRANSFORMS_MAX_CUT_OFF + 1];
} BrotliTransforms;
// Access macros
#define BROTLI_TRANSFORM_PREFIX_ID(T, I) ((T)->transforms[((I) * 3) + 0])
#define BROTLI_TRANSFORM_TYPE(T, I) ((T)->transforms[((I) * 3) + 1])
#define BROTLI_TRANSFORM_SUFFIX_ID(T, I) ((T)->transforms[((I) * 3) + 2])
#define BROTLI_TRANSFORM_PREFIX(T, I) (&(T)->prefix_suffix[...])
#define BROTLI_TRANSFORM_SUFFIX(T, I) (&(T)->prefix_suffix[...])
// Public API (from transform.h)
BROTLI_COMMON_API const BrotliTransforms* BrotliGetTransforms(void);
BROTLI_COMMON_API int BrotliTransformDictionaryWord(
uint8_t* dst,
const uint8_t* word,
int len,
const BrotliTransforms* transforms,
int transform_idx);
// Internal helpers (from transform.cpp)
static int ToUpperCase(uint8_t* p);
static int Shift(uint8_t* word, int word_len, uint16_t parameter);
} // namespace duckdb_brotli
Import
#include "../common/transform.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dst | uint8_t* |
Yes | Destination buffer for the transformed word (must have sufficient space for prefix + word + suffix, max ~545 bytes) |
| word | const uint8_t* |
Yes | Pointer to the source dictionary word bytes |
| len | int |
Yes | Length of the source dictionary word in bytes (4-24 for RFC 7932) |
| transforms | const BrotliTransforms* |
Yes | Pointer to the transforms struct (typically from BrotliGetTransforms())
|
| transform_idx | int |
Yes | Index of the transform to apply (0-120 for RFC 7932) |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | int |
Total number of bytes written to dst (prefix_len + transformed_word_len + suffix_len)
|
| dst | uint8_t* |
Buffer filled with the transformed word: prefix bytes, then the (possibly truncated/case-modified) word, then suffix bytes |
Usage Examples
// Applying a transform during Brotli decompression (from decode.cpp)
#include "../common/dictionary.h"
#include "../common/transform.h"
using namespace duckdb_brotli;
const BrotliDictionary* dict = BrotliGetDictionary();
const BrotliTransforms* transforms = BrotliGetTransforms();
// Suppose the decoder reads word_index=100, length=5, transform=4
int word_len = 5;
int word_idx = 100;
int transform_idx = 4; // BROTLI_TRANSFORM_UPPERCASE_FIRST
uint32_t offset = dict->offsets_by_length[word_len]
+ (word_idx * word_len);
const uint8_t* word = &dict->data[offset];
uint8_t dest[256];
int result_len = BrotliTransformDictionaryWord(
dest, word, word_len, transforms, transform_idx);
// dest now contains the transformed word with first letter uppercased