Implementation:Apache Paimon Schema From Pyarrow
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Data_Ingestion |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for creating Paimon Schema objects from PyArrow schemas for table creation.
Description
Schema.from_pyarrow_schema() converts a PyArrow schema to a Paimon Schema with additional metadata (partition keys, primary keys, table options). It validates blob type constraints and required options. This method is used in Ray ingestion pipelines to ensure the target table schema matches the expected data format.
The conversion process:
- Iterates over each field in the PyArrow schema.
- Converts each PyArrow type to its corresponding Paimon DataField representation.
- Attaches metadata such as partition keys, primary keys, and table options.
- Validates constraints (e.g., blob types require specific configurations).
- Returns a fully constructed Schema instance ready for catalog.create_table().
Usage
Use Schema.from_pyarrow_schema() when creating a new Paimon table whose schema is defined using PyArrow types. This is the standard approach in Ray-based ingestion pipelines where the data source schema is known in PyArrow format.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/schema/schema.py:L28-88
Signature
@dataclass
class Schema:
def __init__(self, fields: Optional[List[DataField]] = None,
partition_keys: Optional[List[str]] = None,
primary_keys: Optional[List[str]] = None,
options: Optional[Dict] = None,
comment: Optional[str] = None):
@staticmethod
def from_pyarrow_schema(pa_schema: pa.Schema,
partition_keys: Optional[List[str]] = None,
primary_keys: Optional[List[str]] = None,
options: Optional[Dict] = None,
comment: Optional[str] = None) -> Schema:
Import
from pypaimon.schema.schema import Schema
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pa_schema | pa.Schema | Yes | PyArrow schema defining the column names and types for the Paimon table. |
| partition_keys | Optional[List[str]] | No | List of column names to use as partition keys. Controls physical data layout for partition pruning. |
| primary_keys | Optional[List[str]] | No | List of column names to use as primary keys. Enables upsert semantics and deduplication. |
| options | Optional[Dict] | No | Table options as a dictionary of string key-value pairs (e.g., {'bucket': '4'}). Controls storage behavior. |
| comment | Optional[str] | No | Optional human-readable description of the table. |
Outputs
| Name | Type | Description |
|---|---|---|
| schema | Schema | A Paimon Schema instance containing the converted fields, partition keys, primary keys, and options. Ready for use with catalog.create_table(). |
Usage Examples
Basic Usage
import pyarrow as pa
from pypaimon.schema.schema import Schema
pa_schema = pa.schema([
('user_id', pa.int64()),
('event', pa.string()),
('timestamp', pa.int64()),
('value', pa.float64()),
])
schema = Schema.from_pyarrow_schema(
pa_schema,
primary_keys=['user_id'],
options={'bucket': '4'}
)
catalog.create_table('analytics.events', schema, ignore_if_exists=True)
table = catalog.get_table('analytics.events')
With Partition Keys
import pyarrow as pa
from pypaimon.schema.schema import Schema
pa_schema = pa.schema([
('order_id', pa.int64()),
('customer_id', pa.int64()),
('order_date', pa.string()),
('amount', pa.float64()),
])
schema = Schema.from_pyarrow_schema(
pa_schema,
partition_keys=['order_date'],
primary_keys=['order_id'],
options={'bucket': '2'},
comment='Daily order data partitioned by date'
)
catalog.create_table('sales.orders', schema, ignore_if_exists=True)