Implementation:Duckdb Duckdb FSST Compression
| Knowledge Sources | |
|---|---|
| Domains | Compression, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
FSST (Fast Static Symbol Table) is a string compression scheme designed for columnar databases, integrated into DuckDB for high-speed, random-access compression of string/text data.
Description
FSST is a compression algorithm specifically designed for string data in analytical databases. Unlike block-based algorithms such as LZ4 or Zstd, FSST compresses individual strings independently, enabling random access to compressed data without decompressing surrounding data in a block. It achieves this by learning a symbol table that maps 1-8 byte sequences ("symbols") onto single-byte "codes" (0-254), with byte 255 (FSST_ESC) reserved as an escape prefix for uncompressed bytes.
The implementation consists of three source files:
- fsst.h (246 lines) -- public C API defining the encoder/decoder types and all external functions. The decoder is a lightweight struct (
duckdb_fsst_decoder_t) containing a 255-entry symbol table and length array, suitable for read-only sharing across threads. The encoder is an opaque type (~900 KB) wrapping the C++SymbolTable.
- libfsst.hpp (466 lines) -- C++ internals including the
Symbolstruct (8-byte value + packed code/length),SymbolTableclass (256 entries plus hash tables for encoding),Countersfor frequency analysis, and thebuildSymbolTable()training function. Symbols store their byte sequence as au64in little-endian format for fast comparison. A symbol's code and length are packed into a singleu64field (icl) for single-load comparison.
- libfsst.cpp (517 lines) -- the training algorithm implementation. The
buildSymbolTable()function iteratively builds the optimal symbol table by: (1) determining the least-frequent byte as the terminator, (2) iterating up to 5 rounds with sample sizes from 1/128 to full input, (3) counting symbol pair frequencies, (4) selecting the best new symbols by gain (frequency x length - 1), and (5) re-encoding the sample to measure total gain. A hash-basedQSymbolset deduplicates candidate symbols.
Key properties:
- Equal strings produce equal compressed forms -- preserves equality semantics
- Decompression is inlined for speed, using a 4-byte-at-a-time fast path on little-endian platforms
- Encoder is not thread-safe -- use
duckdb_fsst_duplicate()for multi-threaded encoding - Decoder is read-only and safely shareable across threads
- Serialization via
duckdb_fsst_export()/duckdb_fsst_import()for compact storage (maxFSST_MAXHEADER = 2066bytes)
Usage
DuckDB uses FSST as a built-in compression method for string columns in its storage layer. When string data is stored, DuckDB may choose FSST compression based on an analysis of the column data. The symbol table is trained on a sample of the column's string values, stored in the column metadata, and used to compress/decompress individual string values with random access.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files:
- third_party/fsst/fsst.h (246 lines) -- FSST public C API
- third_party/fsst/libfsst.hpp (466 lines) -- FSST C++ header with internal types
- third_party/fsst/libfsst.cpp (517 lines) -- FSST compression/training implementation
Signature
// --- Opaque Types ---
typedef void* duckdb_fsst_encoder_t; // wraps ~900KB C++ SymbolTable object
typedef struct {
unsigned long long version;
unsigned char zeroTerminated;
unsigned char len[255]; // byte-length of symbol x (1 <= len[x] <= 8)
unsigned long long symbol[255]; // little-endian byte sequence for code x
} duckdb_fsst_decoder_t;
// --- Symbol Table Creation ---
// Calibrate a symbol table from a batch of strings (best with >= 16KB of data).
duckdb_fsst_encoder_t* duckdb_fsst_create(
size_t n, // number of input strings
size_t lenIn[], // byte-lengths of each input string
unsigned char *strIn[], // start pointers of each input string
int zeroTerminated // whether strings are zero-terminated
);
// Duplicate an encoder for multi-threaded encoding.
duckdb_fsst_encoder_t* duckdb_fsst_duplicate(
duckdb_fsst_encoder_t *encoder
);
// Deallocate an encoder.
void duckdb_fsst_destroy(duckdb_fsst_encoder_t*);
// --- Serialization ---
// Export symbol table to a compact byte buffer.
unsigned int duckdb_fsst_export(
duckdb_fsst_encoder_t *encoder,
unsigned char *buf // output buffer (max FSST_MAXHEADER bytes)
);
// Import symbol table from a serialized byte buffer.
unsigned int duckdb_fsst_import(
duckdb_fsst_decoder_t *decoder, // overwritten with imported table
unsigned char *buf // serialized symbol table bytes
);
// Get decoder from encoder.
duckdb_fsst_decoder_t duckdb_fsst_decoder(
duckdb_fsst_encoder_t *encoder
);
// --- Compression ---
// Compress a batch of strings. Output buffer must be >= 7 + 2*inputlength.
size_t duckdb_fsst_compress(
duckdb_fsst_encoder_t *encoder,
size_t nstrings, // number of strings to compress
size_t lenIn[], // byte-lengths of inputs
unsigned char *strIn[], // input string pointers
size_t outsize, // output buffer capacity
unsigned char *output, // output buffer
size_t lenOut[], // compressed lengths (output)
unsigned char *strOut[] // compressed string pointers (output)
);
// --- Decompression (inlined for speed) ---
inline size_t duckdb_fsst_decompress(
duckdb_fsst_decoder_t *decoder,
size_t lenIn, // compressed string length
const unsigned char *strIn, // compressed string
size_t size, // output buffer capacity
unsigned char *output // output buffer
);
Import
#include "fsst.h" // C API
#include "libfsst.hpp" // C++ internals (for DuckDB integration code)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n / nstrings | size_t |
Yes | Number of strings in the input batch |
| lenIn | size_t[] |
Yes | Array of byte-lengths for each input string |
| strIn | unsigned char*[] |
Yes | Array of pointers to input string start positions |
| zeroTerminated | int |
Yes (for create) | Whether input strings are zero-terminated; if so, compressed output is also zero-terminated |
| encoder | duckdb_fsst_encoder_t* |
Yes (for compress) | Trained symbol table encoder |
| decoder | duckdb_fsst_decoder_t* |
Yes (for decompress) | Symbol table decoder (from export/import or from encoder) |
| outsize / size | size_t |
Yes | Output buffer capacity; must be >= 7 + 2 * total_input_length for compression
|
| output | unsigned char* |
Yes | Output buffer for compressed or decompressed data |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) from duckdb_fsst_compress | size_t |
Number of strings that fit in the output buffer (<= nstrings) |
| (return) from duckdb_fsst_decompress | size_t |
Byte size of the decompressed string; if > size, output is truncated
|
| lenOut | size_t[] |
Byte-lengths of each compressed string |
| strOut | unsigned char*[] |
Pointers into output for each compressed string
|
| (return) from duckdb_fsst_export | unsigned int |
Number of bytes written to the serialization buffer |
| (return) from duckdb_fsst_import | unsigned int |
Number of bytes consumed from the serialization buffer (0 on failure) |
Usage Examples
#include "fsst.h"
#include <cstring>
// --- Train a Symbol Table ---
const char* strings[] = {"hello world", "hello duckdb", "hello fsst"};
size_t lengths[] = {11, 12, 10};
unsigned char* ptrs[] = {
(unsigned char*)strings[0],
(unsigned char*)strings[1],
(unsigned char*)strings[2]
};
duckdb_fsst_encoder_t* encoder = duckdb_fsst_create(
3, lengths, ptrs, 0 /* not zero-terminated */);
// --- Compress a Batch ---
size_t outCapacity = 7 + 2 * (11 + 12 + 10);
unsigned char* output = new unsigned char[outCapacity];
size_t lenOut[3];
unsigned char* strOut[3];
size_t nCompressed = duckdb_fsst_compress(
encoder, 3, lengths, ptrs,
outCapacity, output, lenOut, strOut);
// --- Decompress a Single String ---
duckdb_fsst_decoder_t decoder = duckdb_fsst_decoder(encoder);
unsigned char decompBuf[64];
size_t decompLen = duckdb_fsst_decompress(
&decoder, lenOut[0], strOut[0], sizeof(decompBuf), decompBuf);
// decompBuf now contains "hello world", decompLen == 11
// --- Serialize / Deserialize Symbol Table ---
unsigned char exportBuf[FSST_MAXHEADER];
unsigned int exportSize = duckdb_fsst_export(encoder, exportBuf);
duckdb_fsst_decoder_t importedDecoder;
unsigned int consumed = duckdb_fsst_import(&importedDecoder, exportBuf);
// --- Cleanup ---
duckdb_fsst_destroy(encoder);
delete[] output;