Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon Schema From Pyarrow

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Data_Ingestion
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for creating Paimon Schema objects from PyArrow schemas for table creation.

Description

Schema.from_pyarrow_schema() converts a PyArrow schema to a Paimon Schema with additional metadata (partition keys, primary keys, table options). It validates blob type constraints and required options. This method is used in Ray ingestion pipelines to ensure the target table schema matches the expected data format.

The conversion process:

  1. Iterates over each field in the PyArrow schema.
  2. Converts each PyArrow type to its corresponding Paimon DataField representation.
  3. Attaches metadata such as partition keys, primary keys, and table options.
  4. Validates constraints (e.g., blob types require specific configurations).
  5. Returns a fully constructed Schema instance ready for catalog.create_table().

Usage

Use Schema.from_pyarrow_schema() when creating a new Paimon table whose schema is defined using PyArrow types. This is the standard approach in Ray-based ingestion pipelines where the data source schema is known in PyArrow format.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/schema/schema.py:L28-88

Signature

@dataclass
class Schema:
    def __init__(self, fields: Optional[List[DataField]] = None,
                 partition_keys: Optional[List[str]] = None,
                 primary_keys: Optional[List[str]] = None,
                 options: Optional[Dict] = None,
                 comment: Optional[str] = None):

    @staticmethod
    def from_pyarrow_schema(pa_schema: pa.Schema,
                            partition_keys: Optional[List[str]] = None,
                            primary_keys: Optional[List[str]] = None,
                            options: Optional[Dict] = None,
                            comment: Optional[str] = None) -> Schema:

Import

from pypaimon.schema.schema import Schema

I/O Contract

Inputs

Name Type Required Description
pa_schema pa.Schema Yes PyArrow schema defining the column names and types for the Paimon table.
partition_keys Optional[List[str]] No List of column names to use as partition keys. Controls physical data layout for partition pruning.
primary_keys Optional[List[str]] No List of column names to use as primary keys. Enables upsert semantics and deduplication.
options Optional[Dict] No Table options as a dictionary of string key-value pairs (e.g., {'bucket': '4'}). Controls storage behavior.
comment Optional[str] No Optional human-readable description of the table.

Outputs

Name Type Description
schema Schema A Paimon Schema instance containing the converted fields, partition keys, primary keys, and options. Ready for use with catalog.create_table().

Usage Examples

Basic Usage

import pyarrow as pa
from pypaimon.schema.schema import Schema

pa_schema = pa.schema([
    ('user_id', pa.int64()),
    ('event', pa.string()),
    ('timestamp', pa.int64()),
    ('value', pa.float64()),
])

schema = Schema.from_pyarrow_schema(
    pa_schema,
    primary_keys=['user_id'],
    options={'bucket': '4'}
)

catalog.create_table('analytics.events', schema, ignore_if_exists=True)
table = catalog.get_table('analytics.events')

With Partition Keys

import pyarrow as pa
from pypaimon.schema.schema import Schema

pa_schema = pa.schema([
    ('order_id', pa.int64()),
    ('customer_id', pa.int64()),
    ('order_date', pa.string()),
    ('amount', pa.float64()),
])

schema = Schema.from_pyarrow_schema(
    pa_schema,
    partition_keys=['order_date'],
    primary_keys=['order_id'],
    options={'bucket': '2'},
    comment='Daily order data partitioned by date'
)

catalog.create_table('sales.orders', schema, ignore_if_exists=True)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment