Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Snappy Compression

From Leeroopedia


Knowledge Sources
Domains Compression, Columnar_Storage
Last Updated 2026-02-07 12:00 GMT

Overview

A fast lossless compression algorithm optimized for speed rather than maximum compression ratio, using an LZ77 variant with simplified match-finding.

Description

Snappy is a compression library developed by Google that aims to provide very high speeds with reasonable compression ratios. Rather than targeting maximum compression like zlib or Brotli, Snappy targets extreme speed: approximately 250 MB/s compression and 500 MB/s decompression on a single core. It achieves compression ratios typically between 1.5x and 1.7x for plain text, which is lower than algorithms optimized for ratio but adequate for many applications.

The algorithm is an LZ77 variant that uses a hash table for match-finding, similar to LZ4. The key design decisions that contribute to its speed include: a fixed hash table size (to avoid dynamic memory allocation), simple byte-oriented encoding (no bit-level operations), and aggressive use of 64-bit operations for copying data. Snappy processes data in blocks of up to 64 KB, with each block being independently compressed.

Snappy was originally developed as "Zippy" inside Google and is widely used in Google's internal infrastructure, including BigTable, MapReduce, and the Parquet columnar file format. Its adoption in Parquet makes it particularly relevant for analytical data processing systems.

Usage

Snappy is used in DuckDB as one of the compression codecs supported when reading and writing Apache Parquet files. When a Parquet file specifies Snappy as its compression codec, DuckDB uses Snappy to decompress column chunks during reads and to compress them during writes. It is the default compression in many Parquet-producing systems due to its speed characteristics.

Theoretical Basis

Hash-Based Match Finding: Snappy uses a simple hash table to find LZ77 matches:

// Fixed 16KB hash table (4096 entries of 4 bytes)
hash_table = array[4096]
for each position p in input:
    h = hash(input[p..p+3]) & 0xFFF   // 4-byte hash, masked to 12 bits
    candidate = hash_table[h]
    hash_table[h] = p
    if input[candidate..candidate+3] == input[p..p+3]:
        extend match forward and backward
        emit copy(offset, length)
    else:
        emit literal(input[p])

Tag-Based Encoding: Snappy uses byte-aligned tags with four element types:

Tag byte: [element_type (2 bits) | payload (6 bits)]

Element types:
  00 = Literal  (payload = length - 1, for lengths 1-60)
  01 = Copy with 1-byte offset (length 4-11, offset 0-2047)
  10 = Copy with 2-byte offset (length 1-64, offset 0-65535)
  11 = Copy with 4-byte offset (length 1-64, offset 0-2^32)

Block Structure: Each compressed block is self-contained:

compressed_block:
    varint: uncompressed_length
    repeated:
        tag_byte + optional_extra_bytes + data
    // No end marker; length is known from uncompressed_length

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment