Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon PyarrowFieldParser From Paimon Schema

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Data_Ingestion
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for generating canonical PyArrow schemas from Paimon table field definitions for type alignment.

Description

PyarrowFieldParser.from_paimon_schema() converts a list of Paimon DataField objects to a PyArrow Schema. This canonical schema is then used with ray_dataset.map_batches() to cast incoming data columns to the expected types. The parser handles Paimon-specific types (BLOB, VARIANT) and maps them to corresponding PyArrow types.

The conversion handles the following type mappings:

  • Paimon integer types (TINYINT, SMALLINT, INT, BIGINT) map to PyArrow int8, int16, int32, int64.
  • Paimon floating point types (FLOAT, DOUBLE) map to PyArrow float32, float64.
  • Paimon STRING maps to PyArrow large_string.
  • Paimon BLOB maps to PyArrow large_binary.
  • Paimon VARIANT maps to a specialized PyArrow representation.
  • Paimon DECIMAL(p, s) maps to PyArrow decimal128(p, s).
  • Paimon TIMESTAMP maps to PyArrow timestamp with the appropriate time unit.

Usage

Use PyarrowFieldParser.from_paimon_schema() after loading data into a Ray Dataset and before writing to a Paimon table. The generated PyArrow schema serves as the casting target to align the Dataset's inferred schema with the Paimon table's declared schema.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/schema/data_types.py:L491-496

Signature

class PyarrowFieldParser:
    @staticmethod
    def from_paimon_schema(data_fields: List[DataField]) -> pyarrow.Schema:

Import

from pypaimon.schema.data_types import PyarrowFieldParser

I/O Contract

Inputs

Name Type Required Description
data_fields List[DataField] Yes Paimon schema fields obtained from table.table_schema.fields. Each DataField contains a field name, type, and nullability information.

Outputs

Name Type Description
schema pyarrow.Schema A PyArrow Schema matching the Paimon table structure. Each Paimon DataField is converted to the corresponding PyArrow field with the correct type and nullability.

Usage Examples

Basic Usage

from pypaimon.schema.data_types import PyarrowFieldParser
import pyarrow as pa

# Get target schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)

# Define a casting function for batch-level type alignment
def cast_batch(batch: pa.Table) -> pa.Table:
    return batch.cast(target_schema)

# Apply casting across all batches in the Ray Dataset
aligned_dataset = dataset.map_batches(cast_batch, batch_format="pyarrow")

Schema Inspection Before Casting

from pypaimon.schema.data_types import PyarrowFieldParser

# Get canonical schema from Paimon table
target_schema = PyarrowFieldParser.from_paimon_schema(table.table_schema.fields)

# Compare source and target schemas
source_schema = dataset.schema()
print("Source schema:", source_schema)
print("Target schema:", target_schema)

# Identify mismatched columns
for i, (src_field, tgt_field) in enumerate(zip(source_schema, target_schema)):
    if src_field.type != tgt_field.type:
        print(f"  Column '{src_field.name}': {src_field.type} -> {tgt_field.type}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment