Implementation:Apache Paimon FormatTableWrite
| Knowledge Sources | |
|---|---|
| Domains | Format Tables, Data Writing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FormatTableWrite writes Arrow or Pandas data to format tables in various file formats with support for partitioning and overwrite modes.
Description
The FormatTableWrite class provides comprehensive functionality for writing data to Apache Paimon format tables. It supports writing from PyArrow tables or Pandas DataFrames to multiple file formats including Parquet, ORC, CSV, JSON, and TEXT with automatic format-specific serialization.
The writer handles partitioned writes by grouping rows by partition values and writing each partition to its corresponding directory. It supports both Hive-style partitioning (key=value) and partition-only-value mode. For overwrite operations, it clears existing partition directories before writing new data.
The implementation properly handles partition column validation, row grouping by partition, and format-specific encoding. For TEXT format, it validates that only a single string column is written (excluding partition columns). The writer tracks all written file paths and provides commit messages for integration with table commit workflows.
Usage
Use FormatTableWrite when writing data to external format tables, implementing batch ETL pipelines, or exporting data from Paimon to standard file formats with optional partitioning and overwrite support.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/table/format/format_table_write.py
Signature
class FormatTableWrite:
"""Batch write for format table: Arrow/Pandas to partition dirs."""
def __init__(
self,
table: FormatTable,
overwrite: bool = False,
static_partitions: Optional[Dict[str, str]] = None,
):
"""Initialize with table, overwrite mode, and optional static partitions."""
def write_arrow(self, data: pyarrow.Table) -> None:
"""Write Arrow table."""
def write_arrow_batch(self, data: pyarrow.RecordBatch) -> None:
"""Write Arrow RecordBatch."""
def write_pandas(self, df) -> None:
"""Write Pandas DataFrame."""
def prepare_commit(self) -> List[FormatTableCommitMessage]:
"""Prepare commit messages with written file paths."""
def close(self) -> None:
"""Close writer and release resources."""
Import
from pypaimon.table.format.format_table_write import FormatTableWrite
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table | FormatTable | Yes | Format table to write to |
| overwrite | bool | No | Whether to overwrite existing data (default False) |
| static_partitions | Dict[str, str] | No | Static partition specification for targeted overwrites |
| data | pyarrow.Table or pandas.DataFrame | Yes | Data to write |
Outputs
| Name | Type | Description |
|---|---|---|
| commit_messages | List[FormatTableCommitMessage] | List of written file paths for commit |
Usage Examples
from pypaimon.table.format.format_table_write import FormatTableWrite
import pyarrow as pa
import pandas as pd
# Write Pandas DataFrame
df = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"date": ["2024-01-01", "2024-01-01", "2024-01-02"]
})
writer = FormatTableWrite(table=format_table)
writer.write_pandas(df)
commit_messages = writer.prepare_commit()
writer.close()
# Write with overwrite mode
writer = FormatTableWrite(table=format_table, overwrite=True)
writer.write_pandas(df)
writer.prepare_commit()
writer.close()
# Write to partitioned table
partitioned_table = format_table # has partition_keys = ["date"]
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"name": ["A", "B", "C", "D"],
"date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"]
})
writer = FormatTableWrite(table=partitioned_table)
writer.write_pandas(df)
# Creates: /location/date=2024-01-01/data-*.parquet
# /location/date=2024-01-02/data-*.parquet
# Write Arrow table
arrow_table = pa.table({
"col1": [1, 2, 3],
"col2": ["x", "y", "z"]
})
writer = FormatTableWrite(table=format_table)
writer.write_arrow(arrow_table)
# Write Arrow batch
for batch in arrow_table.to_batches():
writer.write_arrow_batch(batch)
# Overwrite specific partition
writer = FormatTableWrite(
table=partitioned_table,
overwrite=True,
static_partitions={"date": "2024-01-01"}
)
writer.write_pandas(df)
# Only overwrites date=2024-01-01 partition
# TEXT format example
text_table = format_table # format=Format.TEXT
df = pd.DataFrame({"content": ["line1", "line2", "line3"]})
writer = FormatTableWrite(table=text_table)
writer.write_pandas(df)
# Writes text file with one line per row
print(f"Written files: {[msg.written_paths for msg in commit_messages]}")