Principle:Duckdb Duckdb FSST String Compression

Knowledge Sources	FSST: Fast Random Access String Compression FSST GitHub Repository
Domains	Compression, Columnar_Storage, String_Processing
Last Updated	2026-02-07 12:00 GMT

Overview

A lightweight, column-oriented string compression scheme that builds a symbol table of frequent substrings and encodes strings as sequences of single-byte symbol codes, enabling random access without full decompression.

Description

Fast Static Symbol Table (FSST) is a string compression technique designed specifically for analytical database systems that store data in columnar format. Published at VLDB 2020 by Peter Boncz et al. from CWI Amsterdam, FSST addresses the need for a string compression method that is fast to decompress, supports random access to individual values, and achieves meaningful compression ratios on real-world string columns.

FSST works by building a symbol table of up to 255 frequently occurring substrings (of length 1-8 bytes) from the input data. Each string is then encoded by replacing occurrences of these symbols with single-byte codes (0-254), with byte value 255 reserved as an escape character for bytes not in the symbol table. This means every input byte maps to either one byte (a symbol code) or two bytes (escape + original byte), giving a worst-case expansion of 2x.

The symbol table construction uses a greedy algorithm that iteratively selects symbols maximizing compression gain. Starting with an empty table, it counts the frequency of all possible 2-8 byte substrings in the data, picks the best candidates, and repeats this process for a small number of iterations (typically 5). The simplicity of the encoding (just byte-to-byte mapping) makes FSST extremely fast to decompress, achieving decompression speeds of several GB/s.

Usage

FSST is used in DuckDB as a compression method for string columns in the native storage format. When DuckDB detects that a string column would benefit from FSST compression (based on a sampling heuristic), it builds a symbol table from the column values and encodes all strings using that table. The random access property means individual string values can be decompressed without decompressing the entire column segment.

Theoretical Basis

Symbol Table Construction: The greedy algorithm to build the symbol table:

// Build symbol table iteratively
symbol_table = empty (256 entries, index 0-254 usable)
for iteration in 1..5:
    // Count all 2-8 byte substring frequencies
    counters = count_substrings(data, symbol_table)
    // Score candidates by compression gain
    for each candidate substring s:
        gain[s] = frequency[s] * (encoded_len(s) - 1)
    // Select top-255 symbols by gain
    symbol_table = top_k(candidates, 255, key=gain)

Encoding: Strings are encoded by greedy longest-match replacement:

// Encode a single string
function encode(input, symbol_table):
    output = []
    pos = 0
    while pos < len(input):
        best_symbol = find_longest_match(input[pos:], symbol_table)
        if best_symbol found:
            output.append(best_symbol.code)  // 1 byte
            pos += len(best_symbol)
        else:
            output.append(0xFF)              // escape byte
            output.append(input[pos])        // literal byte
            pos += 1
    return output

Decoding: Decompression is a simple table lookup:

// Decode a compressed string
function decode(compressed, symbol_table):
    output = []
    pos = 0
    while pos < len(compressed):
        if compressed[pos] == 0xFF:
            output.append(compressed[pos + 1])  // literal
            pos += 2
        else:
            symbol = symbol_table[compressed[pos]]
            output.append(symbol.bytes)         // expand symbol
            pos += 1
    return output

Compression Properties:

- Symbol table size: up to 255 entries, each 1-8 bytes
- Escape character: 0xFF
- Worst-case expansion: 2x (every byte escaped)
- Typical compression: 2-4x on real-world string data
- Decompression: branchless, cache-friendly table lookup
- Random access: O(1) with per-value offset array

Related Pages

Implementation:Duckdb_Duckdb_FSST_Compression

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment