Principle:Duckdb Duckdb FSST String Compression
| Knowledge Sources | |
|---|---|
| Domains | Compression, Columnar_Storage, String_Processing |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A lightweight, column-oriented string compression scheme that builds a symbol table of frequent substrings and encodes strings as sequences of single-byte symbol codes, enabling random access without full decompression.
Description
Fast Static Symbol Table (FSST) is a string compression technique designed specifically for analytical database systems that store data in columnar format. Published at VLDB 2020 by Peter Boncz et al. from CWI Amsterdam, FSST addresses the need for a string compression method that is fast to decompress, supports random access to individual values, and achieves meaningful compression ratios on real-world string columns.
FSST works by building a symbol table of up to 255 frequently occurring substrings (of length 1-8 bytes) from the input data. Each string is then encoded by replacing occurrences of these symbols with single-byte codes (0-254), with byte value 255 reserved as an escape character for bytes not in the symbol table. This means every input byte maps to either one byte (a symbol code) or two bytes (escape + original byte), giving a worst-case expansion of 2x.
The symbol table construction uses a greedy algorithm that iteratively selects symbols maximizing compression gain. Starting with an empty table, it counts the frequency of all possible 2-8 byte substrings in the data, picks the best candidates, and repeats this process for a small number of iterations (typically 5). The simplicity of the encoding (just byte-to-byte mapping) makes FSST extremely fast to decompress, achieving decompression speeds of several GB/s.
Usage
FSST is used in DuckDB as a compression method for string columns in the native storage format. When DuckDB detects that a string column would benefit from FSST compression (based on a sampling heuristic), it builds a symbol table from the column values and encodes all strings using that table. The random access property means individual string values can be decompressed without decompressing the entire column segment.
Theoretical Basis
Symbol Table Construction: The greedy algorithm to build the symbol table:
// Build symbol table iteratively
symbol_table = empty (256 entries, index 0-254 usable)
for iteration in 1..5:
// Count all 2-8 byte substring frequencies
counters = count_substrings(data, symbol_table)
// Score candidates by compression gain
for each candidate substring s:
gain[s] = frequency[s] * (encoded_len(s) - 1)
// Select top-255 symbols by gain
symbol_table = top_k(candidates, 255, key=gain)
Encoding: Strings are encoded by greedy longest-match replacement:
// Encode a single string
function encode(input, symbol_table):
output = []
pos = 0
while pos < len(input):
best_symbol = find_longest_match(input[pos:], symbol_table)
if best_symbol found:
output.append(best_symbol.code) // 1 byte
pos += len(best_symbol)
else:
output.append(0xFF) // escape byte
output.append(input[pos]) // literal byte
pos += 1
return output
Decoding: Decompression is a simple table lookup:
// Decode a compressed string
function decode(compressed, symbol_table):
output = []
pos = 0
while pos < len(compressed):
if compressed[pos] == 0xFF:
output.append(compressed[pos + 1]) // literal
pos += 2
else:
symbol = symbol_table[compressed[pos]]
output.append(symbol.bytes) // expand symbol
pos += 1
return output
Compression Properties:
- Symbol table size: up to 255 entries, each 1-8 bytes
- Escape character: 0xFF
- Worst-case expansion: 2x (every byte escaped)
- Typical compression: 2-4x on real-world string data
- Decompression: branchless, cache-friendly table lookup
- Random access: O(1) with per-value offset array