Principle:Apache Paimon Batch Data Writing
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Table_Format |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for writing batch data to data lake tables with automatic partitioning and bucketing.
Description
Batch data writing in Paimon accepts data in PyArrow Table, RecordBatch, or pandas DataFrame formats and routes each row to the correct partition and bucket based on the table schema. The writer extracts partition and bucket keys from each row, groups data accordingly, and writes to partition-specific file store writers. This ensures data locality and enables efficient scan pruning on reads.
The write pipeline validates incoming data against the table schema, converts between formats as needed (e.g., pandas to RecordBatch), and manages internal write buffers. Data is organized into partition directories and bucket files on disk, following the Paimon storage layout conventions.
Usage
Use this principle when loading data into Paimon tables. Suitable for ETL jobs, data migration, and batch data ingestion from any source that can produce Arrow-compatible data. The typical workflow involves: (1) obtaining a BatchWriteBuilder from the table, (2) creating a BatchTableWrite writer, (3) writing data via write_arrow(), write_arrow_batch(), or write_pandas(), and (4) calling prepare_commit() to finalize the write batch.
Theoretical Basis
Follows the partition-bucket write model where data is horizontally partitioned by partition keys and hash-bucketed within each partition. Key concepts include:
- Partition routing: Each row is assigned to a partition based on the values of its partition key columns. Rows with the same partition key values are written to the same partition directory.
- Bucket hashing: Within each partition, rows are further distributed across buckets using a hash function on the bucket key columns. This provides data distribution and limits file sizes.
- Builder pattern: The
BatchWriteBuilder -> BatchTableWritechain provides a clean API for write lifecycle management, separating configuration from execution. - Format flexibility: By accepting PyArrow Tables, RecordBatches, and pandas DataFrames, the writer integrates with the most common Python data processing libraries without requiring manual format conversion.