Principle:Duckdb Duckdb Zlib Deflate Compression
| Knowledge Sources | |
|---|---|
| Domains | Compression, Data_Format |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A widely-adopted lossless compression method combining LZ77 duplicate string elimination with Huffman coding, forming the basis of gzip, zlib, and PNG compression formats.
Description
Deflate is a lossless compressed data format specified in RFC 1951 that combines LZ77 (sliding window) compression with Huffman coding. It was originally designed by Phil Katz for the PKZIP archiver and has become one of the most ubiquitous compression formats in computing. The zlib wrapper (RFC 1950) adds a header and checksum around Deflate data, while the gzip wrapper (RFC 1952) adds file metadata and a CRC-32 checksum.
The Deflate algorithm processes input data in blocks. Each block can use one of three modes: no compression (stored), compression with fixed Huffman codes, or compression with dynamic Huffman codes. In the compressed modes, the algorithm first finds LZ77 matches using a hash chain or binary tree, then encodes the resulting literal/length/distance symbols using Huffman codes. The dynamic Huffman mode transmits the code tables as part of the block, while the fixed mode uses pre-defined tables specified in the RFC.
Miniz is a single-source-file implementation of zlib-compatible compression that provides both Deflate compression/decompression and the zlib/gzip wrappers. It is designed to be easily embeddable in other projects and provides API compatibility with the original zlib library.
Usage
Zlib/Deflate compression is used in DuckDB for reading and writing gzip-compressed files (`.gz`, `.csv.gz`). When DuckDB encounters a gzip-compressed input file, it transparently decompresses the data during reading. Similarly, DuckDB can write compressed output in gzip format. The Miniz implementation provides this capability without requiring an external zlib dependency.
Theoretical Basis
LZ77 with Hash Chains: Deflate finds matches using chained hash tables:
// Match finding with hash chains
window_size = 32768 // 32 KB sliding window
hash_table = array[65536] // head of chain for each 3-byte hash
prev = array[32768] // chain links
for each position p:
h = hash(input[p], input[p+1], input[p+2])
candidate = hash_table[h]
hash_table[h] = p
prev[p & 32767] = candidate
// Follow chain to find best match
best_match = search_chain(candidate, max_chain_length)
Huffman Code Encoding: Deflate uses canonical Huffman codes:
// Literal/Length alphabet: 0-255 = literals, 256 = end, 257-285 = lengths
// Distance alphabet: 0-29 = distance codes with extra bits
// Fixed Huffman codes (pre-defined):
// 0-143: 8-bit codes (00110000 - 10111111)
// 144-255: 9-bit codes (110010000 - 111111111)
// 256-279: 7-bit codes (0000000 - 0010111)
// 280-287: 8-bit codes (11000000 - 11000111)
// Dynamic Huffman codes: transmitted in block header
// Code lengths encoded with a secondary Huffman code
Block Structure:
deflate_stream:
repeated {
BFINAL: 1 bit (1 if last block)
BTYPE: 2 bits (00=stored, 01=fixed, 10=dynamic)
block_data:
if stored: LEN + NLEN + raw bytes
if fixed: symbols encoded with fixed Huffman
if dynamic: code_table_description + symbols
}