Principle:Duckdb Duckdb Brotli Compression
| Knowledge Sources | |
|---|---|
| Domains | Compression, Data_Format |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A general-purpose lossless compression algorithm that combines LZ77, Huffman coding, and second-order context modeling to achieve high compression ratios.
Description
Brotli is a compression algorithm developed by Google that achieves higher compression ratios than gzip while maintaining comparable decompression speeds. It uses a combination of a modern variant of the LZ77 algorithm, Huffman coding, and second-order context modeling. Brotli includes a pre-defined dictionary of common words and phrases from web content, which contributes to its superior compression of HTML, CSS, JavaScript, and other web formats.
The algorithm operates in three main stages:
- LZ77 backward reference search — finds repeated byte sequences and encodes them as distance/length pairs
- Context modeling — uses the context of previously seen bytes to improve entropy coding
- Entropy coding — uses Huffman codes (and optionally static prefix codes) to encode the compressed data
Usage
Use Brotli compression when handling HTTP content encoding in DuckDB's httpfs extension, or when reading data from sources that use Brotli compression. It provides better compression ratios than gzip/zlib for web-oriented content.
Theoretical Basis
The core of Brotli combines three compression techniques:
LZ77 Sliding Window: Finds matches in a sliding window of up to 16 MiB:
// Pseudo-code
for each position in input:
find longest match in window (up to 16 MiB back)
if match found: emit (distance, length) pair
else: emit literal byte
Huffman Coding: Encodes symbols using variable-length prefix codes based on frequency:
// More frequent symbols get shorter codes
frequency_table = count_symbols(data)
huffman_tree = build_tree(frequency_table)
encoded = encode_with_tree(data, huffman_tree)
Context Modeling: Uses previous bytes to select among multiple Huffman trees:
context = compute_context(prev_bytes)
tree = select_huffman_tree(context)
symbol = decode_with_tree(tree, bitstream)