Principle:Apache Paimon Batch Data Writing

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Table_Format
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for writing batch data to data lake tables with automatic partitioning and bucketing.

Description

Batch data writing in Paimon accepts data in PyArrow Table, RecordBatch, or pandas DataFrame formats and routes each row to the correct partition and bucket based on the table schema. The writer extracts partition and bucket keys from each row, groups data accordingly, and writes to partition-specific file store writers. This ensures data locality and enables efficient scan pruning on reads.

The write pipeline validates incoming data against the table schema, converts between formats as needed (e.g., pandas to RecordBatch), and manages internal write buffers. Data is organized into partition directories and bucket files on disk, following the Paimon storage layout conventions.

Usage

Use this principle when loading data into Paimon tables. Suitable for ETL jobs, data migration, and batch data ingestion from any source that can produce Arrow-compatible data. The typical workflow involves: (1) obtaining a BatchWriteBuilder from the table, (2) creating a BatchTableWrite writer, (3) writing data via write_arrow(), write_arrow_batch(), or write_pandas(), and (4) calling prepare_commit() to finalize the write batch.

Theoretical Basis

Follows the partition-bucket write model where data is horizontally partitioned by partition keys and hash-bucketed within each partition. Key concepts include:

Partition routing: Each row is assigned to a partition based on the values of its partition key columns. Rows with the same partition key values are written to the same partition directory.
Bucket hashing: Within each partition, rows are further distributed across buckets using a hash function on the bucket key columns. This provides data distribution and limits file sizes.
Builder pattern: The BatchWriteBuilder -> BatchTableWrite chain provides a clean API for write lifecycle management, separating configuration from execution.
Format flexibility: By accepting PyArrow Tables, RecordBatches, and pandas DataFrames, the writer integrates with the most common Python data processing libraries without requiring manual format conversion.

Related Pages

Implemented By

Implementation:Apache_Paimon_BatchTableWrite_Write_Arrow

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment