Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Parquet Serialization

From Leeroopedia


Knowledge Sources
Domains Columnar_Storage, Serialization, Data_Format
Last Updated 2026-02-07 12:00 GMT

Overview

A columnar file format type system and metadata serialization scheme based on Apache Thrift, defining how structured data schemas, statistics, and encoding metadata are stored within Parquet files.

Description

Apache Parquet is a columnar storage file format designed for efficient analytical processing of large datasets. The Parquet format uses Apache Thrift as its serialization framework for encoding file metadata, schema information, column statistics, and other structural elements. Understanding the Parquet type system and serialization is essential for reading and writing Parquet files correctly.

The Parquet type system defines a set of primitive physical types (BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY) and logical types that layer additional semantics on top (STRING, DATE, TIMESTAMP, DECIMAL, UUID, etc.). The schema is represented as a tree of schema elements, where leaf nodes represent physical columns and intermediate nodes represent nested structures (groups). This schema representation is directly derived from Google's Dremel paper and supports complex nested data through repetition and definition levels.

Thrift serialization provides a compact binary encoding for the metadata structures. Each Parquet file contains a file-level metadata footer that describes the schema, row groups, column chunks, encoding types, compression codecs, and column statistics (min/max values, null counts). This metadata is serialized using Thrift's compact binary protocol, which uses variable-length integer encoding and field delta encoding for space efficiency.

Usage

Parquet serialization is used in DuckDB whenever reading from or writing to Parquet files. When reading, DuckDB deserializes the Thrift-encoded metadata footer to understand the file schema, locate column chunks, and use column statistics for predicate pushdown. When writing, DuckDB serializes schema information and column statistics into the Thrift format for the file footer. The type mapping between DuckDB's internal types and Parquet's type system is a critical component of this process.

Theoretical Basis

Parquet File Structure:

Parquet File Layout:
  4 bytes: magic number "PAR1"
  Row Group 1:
    Column Chunk 1: [page header + data] [page header + data] ...
    Column Chunk 2: [page header + data] [page header + data] ...
    ...
  Row Group 2:
    ...
  File Metadata (Thrift-encoded):
    version, schema, row_groups[], key_value_metadata[]
  4 bytes: length of file metadata
  4 bytes: magic number "PAR1"

Thrift Compact Protocol Encoding:

// Field encoding: [field_delta + type (1 byte)] [value]
// Types: boolean, i8, i16, i32, i64, double, binary, struct, list, map

// Variable-length integer (zigzag + varint):
function encode_varint(n):
    zigzag = (n << 1) XOR (n >> 63)   // zigzag encoding
    while zigzag > 0x7F:
        emit(zigzag & 0x7F | 0x80)    // 7 bits + continuation
        zigzag >>= 7
    emit(zigzag & 0x7F)

// Field delta encoding:
// Instead of absolute field IDs, encode delta from previous
// field_header = (delta << 4) | type  (if delta fits in 4 bits)

Schema Representation (Dremel Model):

// Schema tree: groups contain fields, fields have repetition
// Repetition: REQUIRED, OPTIONAL, REPEATED
// Definition level: how many optional fields are defined
// Repetition level: which repeated field has a new value

// Example schema:
message Document {
    required int64 doc_id;
    optional group name {
        repeated group language {
            required string code;
            optional string country;
        }
    }
}

// Flattened columns with levels:
// doc_id:          [def_level=0, rep_level=0]
// name.language.code:    [def_level=2, rep_level=0..1]
// name.language.country: [def_level=2..3, rep_level=0..1]

Column Statistics:

// Per-column-chunk and per-page statistics
Statistics {
    min_value: bytes     // minimum value in column/page
    max_value: bytes     // maximum value in column/page
    null_count: int64    // number of nulls
    distinct_count: int64 // approximate distinct values
    is_min_value_exact: bool
    is_max_value_exact: bool
}
// Used for predicate pushdown: skip row groups/pages
// where filter condition cannot match the value range

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment