Workflow:Apache Paimon Table Read Write

Knowledge Sources	Apache Paimon Paimon Documentation
Domains	Data_Lake, Data_Engineering, ETL
Last Updated	2026-02-07 23:00 GMT

Overview

End-to-end process for creating Paimon tables, writing data from Arrow or Pandas sources, and reading data back with optional predicate pushdown and column projection.

Description

This workflow covers the fundamental operations of the PyPaimon SDK: catalog initialization, database and table creation, batch data writes, and table reads. It supports two catalog backends (filesystem and REST) and two primary data formats (PyArrow Tables and Pandas DataFrames). The write path uses an atomic commit protocol with manifest tracking, while the read path follows a scan-then-read pipeline that plans splits from snapshot metadata and reads data files in parallel. Predicate pushdown and column projection can be applied at read time to minimize I/O.

Usage

Execute this workflow when you need to create a new Paimon table, ingest batch data from Python applications, or read existing table data for analysis. This is the foundational workflow for all PyPaimon operations and serves as the prerequisite for more advanced workflows like distributed processing with Ray or vector similarity search.

Execution Steps

Step 1: Catalog Initialization

Create a Paimon catalog instance by specifying the warehouse location and catalog type. The catalog factory accepts configuration options including the warehouse path (local filesystem, S3, OSS, or HDFS), the metastore type (filesystem or REST), and authentication credentials. For REST catalogs, provide the server URI, authentication token, and optional data access token configuration.

Key considerations:

Choose filesystem catalog for direct storage access or REST catalog for centralized metadata management
Configure storage credentials (access key, secret key, endpoint) for cloud backends
The catalog instance is the entry point for all subsequent operations

Step 2: Database and Table Creation

Create a database within the catalog, then define a table schema specifying column names, data types, primary keys, partition keys, and table-level options. The schema supports Paimon's full type system including primitives, decimals, timestamps, arrays, maps, and rows. Table options control storage format, bucket count, merge engine behavior, and compaction settings.

Key considerations:

Primary keys enable upsert semantics and merge-on-read behavior
Tables without primary keys operate in append-only mode
Bucket count affects write parallelism and data distribution
File format can be set to Avro, Parquet, ORC, or Lance

Step 3: Batch Data Writing

Obtain a batch write builder from the table, then create a writer and commit handler. Feed data as PyArrow Tables, Pandas DataFrames, or PyArrow RecordBatches into the writer. The writer partitions data by partition keys and bucket, serializes rows to the configured file format, and produces data files on storage.

Key considerations:

Each write session produces a set of data files tracked in commit messages
Data is routed to partitions and buckets based on the table schema
The writer handles schema validation and type coercion automatically

Step 4: Atomic Commit

After writing data, call prepare_commit to collect all commit messages, then pass them to the commit handler. The commit operation atomically creates a new snapshot referencing the written data files and updates the manifest metadata. This ensures readers always see a consistent view of the table.

Key considerations:

Commits are atomic: either all data files are visible or none are
The commit creates a new snapshot with an incremented snapshot ID
Manifest files track which data files belong to each snapshot
Failed commits can be retried safely

Step 5: Table Reading with Scan Planning

Create a read builder from the table, optionally applying predicate filters and column projections. The scan planner reads snapshot metadata, evaluates manifest entries against predicates, and produces a plan containing a list of splits. Each split represents one or more data files to read.

Key considerations:

Predicate pushdown eliminates entire data files during planning
Column projection reduces I/O by reading only selected columns
The scan reads from the latest snapshot by default
Splits can be distributed across workers for parallel reads

Step 6: Data Retrieval

Create a reader from the read builder and pass splits to it. The reader opens data files, applies deletion vectors for primary key tables, and produces results as PyArrow Tables, Pandas DataFrames, or row iterators. For primary key tables, a sort-merge reader combines multiple sorted runs to produce the latest row versions.

Key considerations:

Append-only tables read data files directly
Primary key tables apply deletion vectors and merge sorted runs
Results can be output as Arrow Tables, Pandas DataFrames, DuckDB relations, or row iterators
Memory usage scales with the number of concurrent splits being read

Execution Diagram

GitHub URL

Workflow Repository