Implementation:Duckdb Duckdb Parquet Types
| Knowledge Sources | |
|---|---|
| Domains | File_Format, Serialization, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
DuckDB includes auto-generated Thrift serialization code for the Apache Parquet file format, providing C++ type definitions and binary read/write methods for all Parquet metadata structures.
Description
The Parquet types module is auto-generated by the Thrift Compiler (v0.22.0) and resides within the duckdb_parquet namespace. It defines the complete set of Parquet metadata structures as C++ classes that inherit from ::apache::thrift::TBase. Each class implements read() and write() methods for binary serialization via the Thrift protocol. The module covers the full Parquet specification, including:
- Physical types (
Type): BOOLEAN, INT32, INT64, INT96, FLOAT, DOUBLE, BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY - Converted types (
ConvertedType): UTF8, MAP, LIST, ENUM, DECIMAL, DATE, TIME_MILLIS, TIMESTAMP_MILLIS, and others (deprecated in favor of LogicalType) - Logical types (
LogicalType): StringType, DecimalType, TimestampType, TimeType, IntType, UUIDType, and others - Schema elements (
SchemaElement): Column definitions with type, repetition, name, and nesting information - Page headers (
PageHeader,DataPageHeader,DataPageHeaderV2,DictionaryPageHeader): Metadata for individual data pages - Column and row group metadata (
ColumnMetaData,ColumnChunk,RowGroup): Statistics, encodings, compression codecs, and offsets - File metadata (
FileMetaData): Top-level file descriptor containing schema, row groups, version, and key-value metadata - Encryption (
AesGcmV1,AesGcmCtrV1,EncryptionAlgorithm,FileCryptoMetaData): Parquet encryption support
Each metadata class uses Thrift's __isset bitfield pattern to track which optional fields have been explicitly set.
Usage
DuckDB uses these types during Parquet file reading and writing. When reading a Parquet file, the FileMetaData structure is deserialized from the file footer to discover the schema, row groups, and column metadata. Individual PageHeader objects are read before each data page to determine encoding, compression, and value counts. During writes, DuckDB populates these structures and serializes them via the Thrift write() method to produce spec-compliant Parquet files.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files:
- third_party/parquet/parquet_types.h -- Parquet type definitions and class declarations (3377 lines)
- third_party/parquet/parquet_types.cpp -- Thrift serialization implementations (10004 lines)
Signature
namespace duckdb_parquet {
// Physical types supported by Parquet
struct Type {
enum type {
BOOLEAN = 0, INT32 = 1, INT64 = 2, INT96 = 3,
FLOAT = 4, DOUBLE = 5, BYTE_ARRAY = 6, FIXED_LEN_BYTE_ARRAY = 7
};
};
// Converted types (deprecated, superseded by LogicalType)
struct ConvertedType {
enum type {
UTF8 = 0, MAP = 1, MAP_KEY_VALUE = 2, LIST = 3, ENUM = 4,
DECIMAL = 5, DATE = 6, TIME_MILLIS = 7, TIME_MICROS = 8,
TIMESTAMP_MILLIS = 9, TIMESTAMP_MICROS = 10, /* ... */
};
};
// Schema element -- defines a column or group node
class SchemaElement : public virtual ::apache::thrift::TBase {
public:
Type::type type;
int32_t type_length;
FieldRepetitionType::type repetition_type;
std::string name;
int32_t num_children;
ConvertedType::type converted_type;
int32_t scale;
int32_t precision;
int32_t field_id;
LogicalType logicalType;
uint32_t read(::apache::thrift::protocol::TProtocol* iprot) override;
uint32_t write(::apache::thrift::protocol::TProtocol* oprot) const override;
};
// File-level metadata
class FileMetaData : public virtual ::apache::thrift::TBase {
public:
int32_t version;
duckdb::vector<SchemaElement> schema;
int64_t num_rows;
duckdb::vector<RowGroup> row_groups;
duckdb::vector<KeyValue> key_value_metadata;
std::string created_by;
duckdb::vector<ColumnOrder> column_orders;
EncryptionAlgorithm encryption_algorithm;
std::string footer_signing_key_metadata;
uint32_t read(::apache::thrift::protocol::TProtocol* iprot) override;
uint32_t write(::apache::thrift::protocol::TProtocol* oprot) const override;
};
// All Thrift-generated classes follow the same pattern:
// uint32_t read(TProtocol* iprot);
// uint32_t write(TProtocol* oprot) const;
// Enum-to-string helpers
std::ostream& operator<<(std::ostream& out, const Type::type& val);
std::string to_string(const Type::type& val);
// Safe enum cast with validation
template <class ENUM>
static typename ENUM::type SafeEnumCast(
const std::map<int, const char*> &values_to_names,
const int &ecast);
} // namespace duckdb_parquet
Import
#include "parquet_types.h"
// Thrift dependencies (vendored within DuckDB):
#include <thrift/Thrift.h>
#include <thrift/TBase.h>
#include <thrift/protocol/TProtocol.h>
#include <thrift/transport/TTransport.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| iprot | ::apache::thrift::protocol::TProtocol* |
Yes | Thrift protocol reader providing the binary input stream for deserialization |
| oprot | ::apache::thrift::protocol::TProtocol* |
Yes | Thrift protocol writer accepting the binary output stream for serialization (used by write())
|
Outputs
| Name | Type | Description |
|---|---|---|
| bytes_read | uint32_t |
Number of bytes consumed from the input protocol during read()
|
| bytes_written | uint32_t |
Number of bytes written to the output protocol during write()
|
| populated fields | struct members | After read(), the struct's data members and __isset bitfields reflect the deserialized content
|
Usage Examples
#include "parquet_types.h"
// Reading FileMetaData from a Thrift protocol
duckdb_parquet::FileMetaData file_metadata;
file_metadata.read(protocol); // deserialize from binary stream
// Inspecting schema elements
for (auto &elem : file_metadata.schema) {
std::cout << "Column: " << elem.name;
if (elem.__isset.type) {
std::cout << " Type: " << duckdb_parquet::to_string(elem.type);
}
if (elem.__isset.num_children) {
std::cout << " (group with " << elem.num_children << " children)";
}
std::cout << std::endl;
}
// Writing metadata back to a Thrift protocol
duckdb_parquet::SchemaElement schema_elem;
schema_elem.__set_name("my_column");
schema_elem.__set_type(duckdb_parquet::Type::INT64);
schema_elem.__set_repetition_type(duckdb_parquet::FieldRepetitionType::REQUIRED);
schema_elem.write(protocol); // serialize to binary stream
// Accessing row group information
for (auto &rg : file_metadata.row_groups) {
std::cout << "Row group with " << rg.num_rows << " rows" << std::endl;
}