Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb FSST Compression

From Leeroopedia


Knowledge Sources
Domains Compression, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

FSST (Fast Static Symbol Table) is a string compression scheme designed for columnar databases, integrated into DuckDB for high-speed, random-access compression of string/text data.

Description

FSST is a compression algorithm specifically designed for string data in analytical databases. Unlike block-based algorithms such as LZ4 or Zstd, FSST compresses individual strings independently, enabling random access to compressed data without decompressing surrounding data in a block. It achieves this by learning a symbol table that maps 1-8 byte sequences ("symbols") onto single-byte "codes" (0-254), with byte 255 (FSST_ESC) reserved as an escape prefix for uncompressed bytes.

The implementation consists of three source files:

  • fsst.h (246 lines) -- public C API defining the encoder/decoder types and all external functions. The decoder is a lightweight struct (duckdb_fsst_decoder_t) containing a 255-entry symbol table and length array, suitable for read-only sharing across threads. The encoder is an opaque type (~900 KB) wrapping the C++ SymbolTable.
  • libfsst.hpp (466 lines) -- C++ internals including the Symbol struct (8-byte value + packed code/length), SymbolTable class (256 entries plus hash tables for encoding), Counters for frequency analysis, and the buildSymbolTable() training function. Symbols store their byte sequence as a u64 in little-endian format for fast comparison. A symbol's code and length are packed into a single u64 field (icl) for single-load comparison.
  • libfsst.cpp (517 lines) -- the training algorithm implementation. The buildSymbolTable() function iteratively builds the optimal symbol table by: (1) determining the least-frequent byte as the terminator, (2) iterating up to 5 rounds with sample sizes from 1/128 to full input, (3) counting symbol pair frequencies, (4) selecting the best new symbols by gain (frequency x length - 1), and (5) re-encoding the sample to measure total gain. A hash-based QSymbol set deduplicates candidate symbols.

Key properties:

  • Equal strings produce equal compressed forms -- preserves equality semantics
  • Decompression is inlined for speed, using a 4-byte-at-a-time fast path on little-endian platforms
  • Encoder is not thread-safe -- use duckdb_fsst_duplicate() for multi-threaded encoding
  • Decoder is read-only and safely shareable across threads
  • Serialization via duckdb_fsst_export()/duckdb_fsst_import() for compact storage (max FSST_MAXHEADER = 2066 bytes)

Usage

DuckDB uses FSST as a built-in compression method for string columns in its storage layer. When string data is stored, DuckDB may choose FSST compression based on an analysis of the column data. The symbol table is trained on a sample of the column's string values, stored in the column metadata, and used to compress/decompress individual string values with random access.

Code Reference

Source Location

Signature

// --- Opaque Types ---
typedef void* duckdb_fsst_encoder_t;  // wraps ~900KB C++ SymbolTable object

typedef struct {
    unsigned long long version;
    unsigned char zeroTerminated;
    unsigned char len[255];           // byte-length of symbol x (1 <= len[x] <= 8)
    unsigned long long symbol[255];   // little-endian byte sequence for code x
} duckdb_fsst_decoder_t;

// --- Symbol Table Creation ---
// Calibrate a symbol table from a batch of strings (best with >= 16KB of data).
duckdb_fsst_encoder_t* duckdb_fsst_create(
    size_t n,                    // number of input strings
    size_t lenIn[],              // byte-lengths of each input string
    unsigned char *strIn[],      // start pointers of each input string
    int zeroTerminated           // whether strings are zero-terminated
);

// Duplicate an encoder for multi-threaded encoding.
duckdb_fsst_encoder_t* duckdb_fsst_duplicate(
    duckdb_fsst_encoder_t *encoder
);

// Deallocate an encoder.
void duckdb_fsst_destroy(duckdb_fsst_encoder_t*);

// --- Serialization ---
// Export symbol table to a compact byte buffer.
unsigned int duckdb_fsst_export(
    duckdb_fsst_encoder_t *encoder,
    unsigned char *buf              // output buffer (max FSST_MAXHEADER bytes)
);

// Import symbol table from a serialized byte buffer.
unsigned int duckdb_fsst_import(
    duckdb_fsst_decoder_t *decoder, // overwritten with imported table
    unsigned char *buf              // serialized symbol table bytes
);

// Get decoder from encoder.
duckdb_fsst_decoder_t duckdb_fsst_decoder(
    duckdb_fsst_encoder_t *encoder
);

// --- Compression ---
// Compress a batch of strings. Output buffer must be >= 7 + 2*inputlength.
size_t duckdb_fsst_compress(
    duckdb_fsst_encoder_t *encoder,
    size_t nstrings,             // number of strings to compress
    size_t lenIn[],              // byte-lengths of inputs
    unsigned char *strIn[],      // input string pointers
    size_t outsize,              // output buffer capacity
    unsigned char *output,       // output buffer
    size_t lenOut[],             // compressed lengths (output)
    unsigned char *strOut[]      // compressed string pointers (output)
);

// --- Decompression (inlined for speed) ---
inline size_t duckdb_fsst_decompress(
    duckdb_fsst_decoder_t *decoder,
    size_t lenIn,                // compressed string length
    const unsigned char *strIn,  // compressed string
    size_t size,                 // output buffer capacity
    unsigned char *output        // output buffer
);

Import

#include "fsst.h"          // C API
#include "libfsst.hpp"     // C++ internals (for DuckDB integration code)

I/O Contract

Inputs

Name Type Required Description
n / nstrings size_t Yes Number of strings in the input batch
lenIn size_t[] Yes Array of byte-lengths for each input string
strIn unsigned char*[] Yes Array of pointers to input string start positions
zeroTerminated int Yes (for create) Whether input strings are zero-terminated; if so, compressed output is also zero-terminated
encoder duckdb_fsst_encoder_t* Yes (for compress) Trained symbol table encoder
decoder duckdb_fsst_decoder_t* Yes (for decompress) Symbol table decoder (from export/import or from encoder)
outsize / size size_t Yes Output buffer capacity; must be >= 7 + 2 * total_input_length for compression
output unsigned char* Yes Output buffer for compressed or decompressed data

Outputs

Name Type Description
(return) from duckdb_fsst_compress size_t Number of strings that fit in the output buffer (<= nstrings)
(return) from duckdb_fsst_decompress size_t Byte size of the decompressed string; if > size, output is truncated
lenOut size_t[] Byte-lengths of each compressed string
strOut unsigned char*[] Pointers into output for each compressed string
(return) from duckdb_fsst_export unsigned int Number of bytes written to the serialization buffer
(return) from duckdb_fsst_import unsigned int Number of bytes consumed from the serialization buffer (0 on failure)

Usage Examples

#include "fsst.h"
#include <cstring>

// --- Train a Symbol Table ---
const char* strings[] = {"hello world", "hello duckdb", "hello fsst"};
size_t lengths[] = {11, 12, 10};
unsigned char* ptrs[] = {
    (unsigned char*)strings[0],
    (unsigned char*)strings[1],
    (unsigned char*)strings[2]
};

duckdb_fsst_encoder_t* encoder = duckdb_fsst_create(
    3, lengths, ptrs, 0 /* not zero-terminated */);

// --- Compress a Batch ---
size_t outCapacity = 7 + 2 * (11 + 12 + 10);
unsigned char* output = new unsigned char[outCapacity];
size_t lenOut[3];
unsigned char* strOut[3];

size_t nCompressed = duckdb_fsst_compress(
    encoder, 3, lengths, ptrs,
    outCapacity, output, lenOut, strOut);

// --- Decompress a Single String ---
duckdb_fsst_decoder_t decoder = duckdb_fsst_decoder(encoder);
unsigned char decompBuf[64];
size_t decompLen = duckdb_fsst_decompress(
    &decoder, lenOut[0], strOut[0], sizeof(decompBuf), decompBuf);
// decompBuf now contains "hello world", decompLen == 11

// --- Serialize / Deserialize Symbol Table ---
unsigned char exportBuf[FSST_MAXHEADER];
unsigned int exportSize = duckdb_fsst_export(encoder, exportBuf);

duckdb_fsst_decoder_t importedDecoder;
unsigned int consumed = duckdb_fsst_import(&importedDecoder, exportBuf);

// --- Cleanup ---
duckdb_fsst_destroy(encoder);
delete[] output;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment