Workflow:Apache Paimon Table Read Write
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Data_Engineering, ETL |
| Last Updated | 2026-02-07 23:00 GMT |
Overview
End-to-end process for creating Paimon tables, writing data from Arrow or Pandas sources, and reading data back with optional predicate pushdown and column projection.
Description
This workflow covers the fundamental operations of the PyPaimon SDK: catalog initialization, database and table creation, batch data writes, and table reads. It supports two catalog backends (filesystem and REST) and two primary data formats (PyArrow Tables and Pandas DataFrames). The write path uses an atomic commit protocol with manifest tracking, while the read path follows a scan-then-read pipeline that plans splits from snapshot metadata and reads data files in parallel. Predicate pushdown and column projection can be applied at read time to minimize I/O.
Usage
Execute this workflow when you need to create a new Paimon table, ingest batch data from Python applications, or read existing table data for analysis. This is the foundational workflow for all PyPaimon operations and serves as the prerequisite for more advanced workflows like distributed processing with Ray or vector similarity search.
Execution Steps
Step 1: Catalog Initialization
Create a Paimon catalog instance by specifying the warehouse location and catalog type. The catalog factory accepts configuration options including the warehouse path (local filesystem, S3, OSS, or HDFS), the metastore type (filesystem or REST), and authentication credentials. For REST catalogs, provide the server URI, authentication token, and optional data access token configuration.
Key considerations:
- Choose filesystem catalog for direct storage access or REST catalog for centralized metadata management
- Configure storage credentials (access key, secret key, endpoint) for cloud backends
- The catalog instance is the entry point for all subsequent operations
Step 2: Database and Table Creation
Create a database within the catalog, then define a table schema specifying column names, data types, primary keys, partition keys, and table-level options. The schema supports Paimon's full type system including primitives, decimals, timestamps, arrays, maps, and rows. Table options control storage format, bucket count, merge engine behavior, and compaction settings.
Key considerations:
- Primary keys enable upsert semantics and merge-on-read behavior
- Tables without primary keys operate in append-only mode
- Bucket count affects write parallelism and data distribution
- File format can be set to Avro, Parquet, ORC, or Lance
Step 3: Batch Data Writing
Obtain a batch write builder from the table, then create a writer and commit handler. Feed data as PyArrow Tables, Pandas DataFrames, or PyArrow RecordBatches into the writer. The writer partitions data by partition keys and bucket, serializes rows to the configured file format, and produces data files on storage.
Key considerations:
- Each write session produces a set of data files tracked in commit messages
- Data is routed to partitions and buckets based on the table schema
- The writer handles schema validation and type coercion automatically
Step 4: Atomic Commit
After writing data, call prepare_commit to collect all commit messages, then pass them to the commit handler. The commit operation atomically creates a new snapshot referencing the written data files and updates the manifest metadata. This ensures readers always see a consistent view of the table.
Key considerations:
- Commits are atomic: either all data files are visible or none are
- The commit creates a new snapshot with an incremented snapshot ID
- Manifest files track which data files belong to each snapshot
- Failed commits can be retried safely
Step 5: Table Reading with Scan Planning
Create a read builder from the table, optionally applying predicate filters and column projections. The scan planner reads snapshot metadata, evaluates manifest entries against predicates, and produces a plan containing a list of splits. Each split represents one or more data files to read.
Key considerations:
- Predicate pushdown eliminates entire data files during planning
- Column projection reduces I/O by reading only selected columns
- The scan reads from the latest snapshot by default
- Splits can be distributed across workers for parallel reads
Step 6: Data Retrieval
Create a reader from the read builder and pass splits to it. The reader opens data files, applies deletion vectors for primary key tables, and produces results as PyArrow Tables, Pandas DataFrames, or row iterators. For primary key tables, a sort-merge reader combines multiple sorted runs to produce the latest row versions.
Key considerations:
- Append-only tables read data files directly
- Primary key tables apply deletion vectors and merge sorted runs
- Results can be output as Arrow Tables, Pandas DataFrames, DuckDB relations, or row iterators
- Memory usage scales with the number of concurrent splits being read