Principle:Duckdb Duckdb Snappy Compression
| Knowledge Sources | |
|---|---|
| Domains | Compression, Columnar_Storage |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A fast lossless compression algorithm optimized for speed rather than maximum compression ratio, using an LZ77 variant with simplified match-finding.
Description
Snappy is a compression library developed by Google that aims to provide very high speeds with reasonable compression ratios. Rather than targeting maximum compression like zlib or Brotli, Snappy targets extreme speed: approximately 250 MB/s compression and 500 MB/s decompression on a single core. It achieves compression ratios typically between 1.5x and 1.7x for plain text, which is lower than algorithms optimized for ratio but adequate for many applications.
The algorithm is an LZ77 variant that uses a hash table for match-finding, similar to LZ4. The key design decisions that contribute to its speed include: a fixed hash table size (to avoid dynamic memory allocation), simple byte-oriented encoding (no bit-level operations), and aggressive use of 64-bit operations for copying data. Snappy processes data in blocks of up to 64 KB, with each block being independently compressed.
Snappy was originally developed as "Zippy" inside Google and is widely used in Google's internal infrastructure, including BigTable, MapReduce, and the Parquet columnar file format. Its adoption in Parquet makes it particularly relevant for analytical data processing systems.
Usage
Snappy is used in DuckDB as one of the compression codecs supported when reading and writing Apache Parquet files. When a Parquet file specifies Snappy as its compression codec, DuckDB uses Snappy to decompress column chunks during reads and to compress them during writes. It is the default compression in many Parquet-producing systems due to its speed characteristics.
Theoretical Basis
Hash-Based Match Finding: Snappy uses a simple hash table to find LZ77 matches:
// Fixed 16KB hash table (4096 entries of 4 bytes)
hash_table = array[4096]
for each position p in input:
h = hash(input[p..p+3]) & 0xFFF // 4-byte hash, masked to 12 bits
candidate = hash_table[h]
hash_table[h] = p
if input[candidate..candidate+3] == input[p..p+3]:
extend match forward and backward
emit copy(offset, length)
else:
emit literal(input[p])
Tag-Based Encoding: Snappy uses byte-aligned tags with four element types:
Tag byte: [element_type (2 bits) | payload (6 bits)]
Element types:
00 = Literal (payload = length - 1, for lengths 1-60)
01 = Copy with 1-byte offset (length 4-11, offset 0-2047)
10 = Copy with 2-byte offset (length 1-64, offset 0-65535)
11 = Copy with 4-byte offset (length 1-64, offset 0-2^32)
Block Structure: Each compressed block is self-contained:
compressed_block:
varint: uncompressed_length
repeated:
tag_byte + optional_extra_bytes + data
// No end marker; length is known from uncompressed_length