Implementation:Apache Paimon PyarrowFieldParser From Paimon Schema
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Data_Ingestion |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for generating canonical PyArrow schemas from Paimon table field definitions for type alignment.
Description
PyarrowFieldParser.from_paimon_schema() converts a list of Paimon DataField objects to a PyArrow Schema. This canonical schema is then used with ray_dataset.map_batches() to cast incoming data columns to the expected types. The parser handles Paimon-specific types (BLOB, VARIANT) and maps them to corresponding PyArrow types.
The conversion handles the following type mappings:
- Paimon integer types (TINYINT, SMALLINT, INT, BIGINT) map to PyArrow int8, int16, int32, int64.
- Paimon floating point types (FLOAT, DOUBLE) map to PyArrow float32, float64.
- Paimon STRING maps to PyArrow large_string.
- Paimon BLOB maps to PyArrow large_binary.
- Paimon VARIANT maps to a specialized PyArrow representation.
- Paimon DECIMAL(p, s) maps to PyArrow decimal128(p, s).
- Paimon TIMESTAMP maps to PyArrow timestamp with the appropriate time unit.
Usage
Use PyarrowFieldParser.from_paimon_schema() after loading data into a Ray Dataset and before writing to a Paimon table. The generated PyArrow schema serves as the casting target to align the Dataset's inferred schema with the Paimon table's declared schema.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/schema/data_types.py:L491-496
Signature
class PyarrowFieldParser:
@staticmethod
def from_paimon_schema(data_fields: List[DataField]) -> pyarrow.Schema:
Import
from pypaimon.schema.data_types import PyarrowFieldParser
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_fields | List[DataField] | Yes | Paimon schema fields obtained from table.table_schema.fields. Each DataField contains a field name, type, and nullability information. |
Outputs
| Name | Type | Description |
|---|---|---|
| schema | pyarrow.Schema | A PyArrow Schema matching the Paimon table structure. Each Paimon DataField is converted to the corresponding PyArrow field with the correct type and nullability. |
Usage Examples
Basic Usage
from pypaimon.schema.data_types import PyarrowFieldParser
import pyarrow as pa
# Get target schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)
# Define a casting function for batch-level type alignment
def cast_batch(batch: pa.Table) -> pa.Table:
return batch.cast(target_schema)
# Apply casting across all batches in the Ray Dataset
aligned_dataset = dataset.map_batches(cast_batch, batch_format="pyarrow")
Schema Inspection Before Casting
from pypaimon.schema.data_types import PyarrowFieldParser
# Get canonical schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)
# Compare source and target schemas
source_schema = dataset.schema()
print("Source schema:", source_schema)
print("Target schema:", target_schema)
# Identify mismatched columns
for i, (src_field, tgt_field) in enumerate(zip(source_schema, target_schema)):
if src_field.type != tgt_field.type:
print(f" Column '{src_field.name}': {src_field.type} -> {tgt_field.type}")