Implementation:Apache Paimon PyarrowFieldParser From Paimon Schema

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Data_Ingestion
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for generating canonical PyArrow schemas from Paimon table field definitions for type alignment.

Description

PyarrowFieldParser.from_paimon_schema() converts a list of Paimon DataField objects to a PyArrow Schema. This canonical schema is then used with ray_dataset.map_batches() to cast incoming data columns to the expected types. The parser handles Paimon-specific types (BLOB, VARIANT) and maps them to corresponding PyArrow types.

The conversion handles the following type mappings:

Paimon integer types (TINYINT, SMALLINT, INT, BIGINT) map to PyArrow int8, int16, int32, int64.
Paimon floating point types (FLOAT, DOUBLE) map to PyArrow float32, float64.
Paimon STRING maps to PyArrow large_string.
Paimon BLOB maps to PyArrow large_binary.
Paimon VARIANT maps to a specialized PyArrow representation.
Paimon DECIMAL(p, s) maps to PyArrow decimal128(p, s).
Paimon TIMESTAMP maps to PyArrow timestamp with the appropriate time unit.

Usage

Use PyarrowFieldParser.from_paimon_schema() after loading data into a Ray Dataset and before writing to a Paimon table. The generated PyArrow schema serves as the casting target to align the Dataset's inferred schema with the Paimon table's declared schema.

Code Reference

Source Location

Repository: Apache Paimon
File: paimon-python/pypaimon/schema/data_types.py:L491-496

Signature

class PyarrowFieldParser:
    @staticmethod
    def from_paimon_schema(data_fields: List[DataField]) -> pyarrow.Schema:

Import

from pypaimon.schema.data_types import PyarrowFieldParser

I/O Contract

Inputs

Name	Type	Required	Description
data_fields	List[DataField]	Yes	Paimon schema fields obtained from table.table_schema.fields. Each DataField contains a field name, type, and nullability information.

Outputs

Name	Type	Description
schema	pyarrow.Schema	A PyArrow Schema matching the Paimon table structure. Each Paimon DataField is converted to the corresponding PyArrow field with the correct type and nullability.

Usage Examples

Basic Usage

from pypaimon.schema.data_types import PyarrowFieldParser
import pyarrow as pa

# Get target schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)

# Define a casting function for batch-level type alignment
def cast_batch(batch: pa.Table) -> pa.Table:
    return batch.cast(target_schema)

# Apply casting across all batches in the Ray Dataset
aligned_dataset = dataset.map_batches(cast_batch, batch_format="pyarrow")

Schema Inspection Before Casting

from pypaimon.schema.data_types import PyarrowFieldParser

# Get canonical schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)

# Compare source and target schemas
source_schema = dataset.schema()
print("Source schema:", source_schema)
print("Target schema:", target_schema)

# Identify mismatched columns
for i, (src_field, tgt_field) in enumerate(zip(source_schema, target_schema)):
    if src_field.type != tgt_field.type:
        print(f"  Column '{src_field.name}': {src_field.type} -> {tgt_field.type}")

Related Pages

Implements Principle

Principle:Apache_Paimon_Schema_Alignment

Requires Environment

Environment:Apache_Paimon_Python_Core_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment