Implementation:Apache Paimon FormatTableWrite

Knowledge Sources	Apache_Paimon
Domains	Format Tables, Data Writing
Last Updated	2026-02-08 00:00 GMT

Overview

FormatTableWrite writes Arrow or Pandas data to format tables in various file formats with support for partitioning and overwrite modes.

Description

The FormatTableWrite class provides comprehensive functionality for writing data to Apache Paimon format tables. It supports writing from PyArrow tables or Pandas DataFrames to multiple file formats including Parquet, ORC, CSV, JSON, and TEXT with automatic format-specific serialization.

The writer handles partitioned writes by grouping rows by partition values and writing each partition to its corresponding directory. It supports both Hive-style partitioning (key=value) and partition-only-value mode. For overwrite operations, it clears existing partition directories before writing new data.

The implementation properly handles partition column validation, row grouping by partition, and format-specific encoding. For TEXT format, it validates that only a single string column is written (excluding partition columns). The writer tracks all written file paths and provides commit messages for integration with table commit workflows.

Usage

Use FormatTableWrite when writing data to external format tables, implementing batch ETL pipelines, or exporting data from Paimon to standard file formats with optional partitioning and overwrite support.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/table/format/format_table_write.py

Signature

class FormatTableWrite:
    """Batch write for format table: Arrow/Pandas to partition dirs."""

    def __init__(
        self,
        table: FormatTable,
        overwrite: bool = False,
        static_partitions: Optional[Dict[str, str]] = None,
    ):
        """Initialize with table, overwrite mode, and optional static partitions."""

    def write_arrow(self, data: pyarrow.Table) -> None:
        """Write Arrow table."""

    def write_arrow_batch(self, data: pyarrow.RecordBatch) -> None:
        """Write Arrow RecordBatch."""

    def write_pandas(self, df) -> None:
        """Write Pandas DataFrame."""

    def prepare_commit(self) -> List[FormatTableCommitMessage]:
        """Prepare commit messages with written file paths."""

    def close(self) -> None:
        """Close writer and release resources."""

Import

from pypaimon.table.format.format_table_write import FormatTableWrite

I/O Contract

Inputs

Name	Type	Required	Description
table	FormatTable	Yes	Format table to write to
overwrite	bool	No	Whether to overwrite existing data (default False)
static_partitions	Dict[str, str]	No	Static partition specification for targeted overwrites
data	pyarrow.Table or pandas.DataFrame	Yes	Data to write

Outputs

Name	Type	Description
commit_messages	List[FormatTableCommitMessage]	List of written file paths for commit

Usage Examples

from pypaimon.table.format.format_table_write import FormatTableWrite
import pyarrow as pa
import pandas as pd

# Write Pandas DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02"]
})

writer = FormatTableWrite(table=format_table)
writer.write_pandas(df)
commit_messages = writer.prepare_commit()
writer.close()

# Write with overwrite mode
writer = FormatTableWrite(table=format_table, overwrite=True)
writer.write_pandas(df)
writer.prepare_commit()
writer.close()

# Write to partitioned table
partitioned_table = format_table  # has partition_keys = ["date"]
df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "name": ["A", "B", "C", "D"],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"]
})

writer = FormatTableWrite(table=partitioned_table)
writer.write_pandas(df)
# Creates: /location/date=2024-01-01/data-*.parquet
#          /location/date=2024-01-02/data-*.parquet

# Write Arrow table
arrow_table = pa.table({
    "col1": [1, 2, 3],
    "col2": ["x", "y", "z"]
})

writer = FormatTableWrite(table=format_table)
writer.write_arrow(arrow_table)

# Write Arrow batch
for batch in arrow_table.to_batches():
    writer.write_arrow_batch(batch)

# Overwrite specific partition
writer = FormatTableWrite(
    table=partitioned_table,
    overwrite=True,
    static_partitions={"date": "2024-01-01"}
)
writer.write_pandas(df)
# Only overwrites date=2024-01-01 partition

# TEXT format example
text_table = format_table  # format=Format.TEXT
df = pd.DataFrame({"content": ["line1", "line2", "line3"]})
writer = FormatTableWrite(table=text_table)
writer.write_pandas(df)
# Writes text file with one line per row

print(f"Written files: {[msg.written_paths for msg in commit_messages]}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment