Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Brotli Compression

From Leeroopedia
Revision as of 17:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Duckdb_Duckdb_Brotli_Compression.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Compression, Data_Format
Last Updated 2026-02-07 12:00 GMT

Overview

A general-purpose lossless compression algorithm that combines LZ77, Huffman coding, and second-order context modeling to achieve high compression ratios.

Description

Brotli is a compression algorithm developed by Google that achieves higher compression ratios than gzip while maintaining comparable decompression speeds. It uses a combination of a modern variant of the LZ77 algorithm, Huffman coding, and second-order context modeling. Brotli includes a pre-defined dictionary of common words and phrases from web content, which contributes to its superior compression of HTML, CSS, JavaScript, and other web formats.

The algorithm operates in three main stages:

  1. LZ77 backward reference search — finds repeated byte sequences and encodes them as distance/length pairs
  2. Context modeling — uses the context of previously seen bytes to improve entropy coding
  3. Entropy coding — uses Huffman codes (and optionally static prefix codes) to encode the compressed data

Usage

Use Brotli compression when handling HTTP content encoding in DuckDB's httpfs extension, or when reading data from sources that use Brotli compression. It provides better compression ratios than gzip/zlib for web-oriented content.

Theoretical Basis

The core of Brotli combines three compression techniques:

LZ77 Sliding Window: Finds matches in a sliding window of up to 16 MiB:

// Pseudo-code
for each position in input:
    find longest match in window (up to 16 MiB back)
    if match found: emit (distance, length) pair
    else: emit literal byte

Huffman Coding: Encodes symbols using variable-length prefix codes based on frequency:

// More frequent symbols get shorter codes
frequency_table = count_symbols(data)
huffman_tree = build_tree(frequency_table)
encoded = encode_with_tree(data, huffman_tree)

Context Modeling: Uses previous bytes to select among multiple Huffman trees:

context = compute_context(prev_bytes)
tree = select_huffman_tree(context)
symbol = decode_with_tree(tree, bitstream)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment