Principle:Apache Paimon Schema Alignment

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Data_Ingestion
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for aligning external data schemas with target Paimon table schemas through type conversion and column casting.

Description

Schema alignment resolves type mismatches between incoming data (from Ray Datasets) and target Paimon table schemas. External data often has inferred types that differ from the declared Paimon schema (e.g., int32 vs int64, or string vs large_string). PyarrowFieldParser.from_paimon_schema() generates the canonical PyArrow schema from the Paimon table's fields, and ray_dataset.map_batches() applies batch-level type casting to conform the data.

The alignment process involves:

Obtain the target schema: Call PyarrowFieldParser.from_paimon_schema(table.table_schema.fields) to generate the canonical PyArrow schema that matches the Paimon table's field definitions.
Define a casting function: Create a function that takes a PyArrow Table batch and calls batch.cast(target_schema) to convert column types.
Apply batch-level casting: Use dataset.map_batches(cast_fn, batch_format="pyarrow") to apply the casting function across all batches in the distributed Dataset.

This step is critical for preventing write failures due to schema incompatibility. Common mismatches include:

Integer width differences: JSON readers may infer int32 while the Paimon table declares int64.
String type variants: Some readers produce string while Paimon may use large_string.
Floating point precision: Source data may use float32 while the table requires float64.

Usage

Use between loading external data and writing to Paimon when type mismatches exist between the source and target schemas. This step is especially important when the data source uses schema inference (e.g., read_json) rather than an explicit schema declaration.

Theoretical Basis

Type coercion follows the principle of widening conversions (safe casts) where possible, raising errors for incompatible narrowing casts. This is based on type theory's concept of subtyping: a value of type int32 can always be safely represented as int64 (widening), but the reverse is not always true (narrowing).

Key theoretical properties:

Safe casts: Widening conversions (e.g., int32 to int64, float32 to float64) preserve data fidelity without loss of information.
Batch-level efficiency: Casting operates on entire Arrow record batches using columnar operations, which is significantly more efficient than row-level type conversion due to CPU cache locality and SIMD optimization opportunities.
Schema contract enforcement: The target schema derived from the Paimon table acts as a strict contract. Any column that cannot be cast to the target type will raise an error, preventing silent data corruption.

Related Pages

Implemented By

Implementation:Apache_Paimon_PyarrowFieldParser_From_Paimon_Schema

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment