Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb Parquet Types

From Leeroopedia
Revision as of 14:51, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Duckdb_Duckdb_Parquet_Types.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains File_Format, Serialization, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

DuckDB includes auto-generated Thrift serialization code for the Apache Parquet file format, providing C++ type definitions and binary read/write methods for all Parquet metadata structures.

Description

The Parquet types module is auto-generated by the Thrift Compiler (v0.22.0) and resides within the duckdb_parquet namespace. It defines the complete set of Parquet metadata structures as C++ classes that inherit from ::apache::thrift::TBase. Each class implements read() and write() methods for binary serialization via the Thrift protocol. The module covers the full Parquet specification, including:

  • Physical types (Type): BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
  • Converted types (ConvertedType): UTF8, MAP, LIST, ENUM, DECIMAL, DATE, TIME_MILLIS, TIMESTAMP_MILLIS, and others (deprecated in favor of LogicalType)
  • Logical types (LogicalType): StringType, DecimalType, TimestampType, TimeType, IntType, UUIDType, and others
  • Schema elements (SchemaElement): Column definitions with type, repetition, name, and nesting information
  • Page headers (PageHeader, DataPageHeader, DataPageHeaderV2, DictionaryPageHeader): Metadata for individual data pages
  • Column and row group metadata (ColumnMetaData, ColumnChunk, RowGroup): Statistics, encodings, compression codecs, and offsets
  • File metadata (FileMetaData): Top-level file descriptor containing schema, row groups, version, and key-value metadata
  • Encryption (AesGcmV1, AesGcmCtrV1, EncryptionAlgorithm, FileCryptoMetaData): Parquet encryption support

Each metadata class uses Thrift's __isset bitfield pattern to track which optional fields have been explicitly set.

Usage

DuckDB uses these types during Parquet file reading and writing. When reading a Parquet file, the FileMetaData structure is deserialized from the file footer to discover the schema, row groups, and column metadata. Individual PageHeader objects are read before each data page to determine encoding, compression, and value counts. During writes, DuckDB populates these structures and serializes them via the Thrift write() method to produce spec-compliant Parquet files.

Code Reference

Source Location

Signature

namespace duckdb_parquet {

// Physical types supported by Parquet
struct Type {
    enum type {
        BOOLEAN = 0, INT32 = 1, INT64 = 2, INT96 = 3,
        FLOAT = 4, DOUBLE = 5, BYTE_ARRAY = 6, FIXED_LEN_BYTE_ARRAY = 7
    };
};

// Converted types (deprecated, superseded by LogicalType)
struct ConvertedType {
    enum type {
        UTF8 = 0, MAP = 1, MAP_KEY_VALUE = 2, LIST = 3, ENUM = 4,
        DECIMAL = 5, DATE = 6, TIME_MILLIS = 7, TIME_MICROS = 8,
        TIMESTAMP_MILLIS = 9, TIMESTAMP_MICROS = 10, /* ... */
    };
};

// Schema element -- defines a column or group node
class SchemaElement : public virtual ::apache::thrift::TBase {
public:
    Type::type type;
    int32_t type_length;
    FieldRepetitionType::type repetition_type;
    std::string name;
    int32_t num_children;
    ConvertedType::type converted_type;
    int32_t scale;
    int32_t precision;
    int32_t field_id;
    LogicalType logicalType;

    uint32_t read(::apache::thrift::protocol::TProtocol* iprot) override;
    uint32_t write(::apache::thrift::protocol::TProtocol* oprot) const override;
};

// File-level metadata
class FileMetaData : public virtual ::apache::thrift::TBase {
public:
    int32_t version;
    duckdb::vector<SchemaElement> schema;
    int64_t num_rows;
    duckdb::vector<RowGroup> row_groups;
    duckdb::vector<KeyValue> key_value_metadata;
    std::string created_by;
    duckdb::vector<ColumnOrder> column_orders;
    EncryptionAlgorithm encryption_algorithm;
    std::string footer_signing_key_metadata;

    uint32_t read(::apache::thrift::protocol::TProtocol* iprot) override;
    uint32_t write(::apache::thrift::protocol::TProtocol* oprot) const override;
};

// All Thrift-generated classes follow the same pattern:
//   uint32_t read(TProtocol* iprot);
//   uint32_t write(TProtocol* oprot) const;

// Enum-to-string helpers
std::ostream& operator<<(std::ostream& out, const Type::type& val);
std::string to_string(const Type::type& val);

// Safe enum cast with validation
template <class ENUM>
static typename ENUM::type SafeEnumCast(
    const std::map<int, const char*> &values_to_names,
    const int &ecast);

} // namespace duckdb_parquet

Import

#include "parquet_types.h"

// Thrift dependencies (vendored within DuckDB):
#include <thrift/Thrift.h>
#include <thrift/TBase.h>
#include <thrift/protocol/TProtocol.h>
#include <thrift/transport/TTransport.h>

I/O Contract

Inputs

Name Type Required Description
iprot ::apache::thrift::protocol::TProtocol* Yes Thrift protocol reader providing the binary input stream for deserialization
oprot ::apache::thrift::protocol::TProtocol* Yes Thrift protocol writer accepting the binary output stream for serialization (used by write())

Outputs

Name Type Description
bytes_read uint32_t Number of bytes consumed from the input protocol during read()
bytes_written uint32_t Number of bytes written to the output protocol during write()
populated fields struct members After read(), the struct's data members and __isset bitfields reflect the deserialized content

Usage Examples

#include "parquet_types.h"

// Reading FileMetaData from a Thrift protocol
duckdb_parquet::FileMetaData file_metadata;
file_metadata.read(protocol);  // deserialize from binary stream

// Inspecting schema elements
for (auto &elem : file_metadata.schema) {
    std::cout << "Column: " << elem.name;
    if (elem.__isset.type) {
        std::cout << " Type: " << duckdb_parquet::to_string(elem.type);
    }
    if (elem.__isset.num_children) {
        std::cout << " (group with " << elem.num_children << " children)";
    }
    std::cout << std::endl;
}

// Writing metadata back to a Thrift protocol
duckdb_parquet::SchemaElement schema_elem;
schema_elem.__set_name("my_column");
schema_elem.__set_type(duckdb_parquet::Type::INT64);
schema_elem.__set_repetition_type(duckdb_parquet::FieldRepetitionType::REQUIRED);
schema_elem.write(protocol);  // serialize to binary stream

// Accessing row group information
for (auto &rg : file_metadata.row_groups) {
    std::cout << "Row group with " << rg.num_rows << " rows" << std::endl;
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment