Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon FormatTableWrite

From Leeroopedia


Knowledge Sources
Domains Format Tables, Data Writing
Last Updated 2026-02-08 00:00 GMT

Overview

FormatTableWrite writes Arrow or Pandas data to format tables in various file formats with support for partitioning and overwrite modes.

Description

The FormatTableWrite class provides comprehensive functionality for writing data to Apache Paimon format tables. It supports writing from PyArrow tables or Pandas DataFrames to multiple file formats including Parquet, ORC, CSV, JSON, and TEXT with automatic format-specific serialization.

The writer handles partitioned writes by grouping rows by partition values and writing each partition to its corresponding directory. It supports both Hive-style partitioning (key=value) and partition-only-value mode. For overwrite operations, it clears existing partition directories before writing new data.

The implementation properly handles partition column validation, row grouping by partition, and format-specific encoding. For TEXT format, it validates that only a single string column is written (excluding partition columns). The writer tracks all written file paths and provides commit messages for integration with table commit workflows.

Usage

Use FormatTableWrite when writing data to external format tables, implementing batch ETL pipelines, or exporting data from Paimon to standard file formats with optional partitioning and overwrite support.

Code Reference

Source Location

Signature

class FormatTableWrite:
    """Batch write for format table: Arrow/Pandas to partition dirs."""

    def __init__(
        self,
        table: FormatTable,
        overwrite: bool = False,
        static_partitions: Optional[Dict[str, str]] = None,
    ):
        """Initialize with table, overwrite mode, and optional static partitions."""

    def write_arrow(self, data: pyarrow.Table) -> None:
        """Write Arrow table."""

    def write_arrow_batch(self, data: pyarrow.RecordBatch) -> None:
        """Write Arrow RecordBatch."""

    def write_pandas(self, df) -> None:
        """Write Pandas DataFrame."""

    def prepare_commit(self) -> List[FormatTableCommitMessage]:
        """Prepare commit messages with written file paths."""

    def close(self) -> None:
        """Close writer and release resources."""

Import

from pypaimon.table.format.format_table_write import FormatTableWrite

I/O Contract

Inputs

Name Type Required Description
table FormatTable Yes Format table to write to
overwrite bool No Whether to overwrite existing data (default False)
static_partitions Dict[str, str] No Static partition specification for targeted overwrites
data pyarrow.Table or pandas.DataFrame Yes Data to write

Outputs

Name Type Description
commit_messages List[FormatTableCommitMessage] List of written file paths for commit

Usage Examples

from pypaimon.table.format.format_table_write import FormatTableWrite
import pyarrow as pa
import pandas as pd

# Write Pandas DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02"]
})

writer = FormatTableWrite(table=format_table)
writer.write_pandas(df)
commit_messages = writer.prepare_commit()
writer.close()

# Write with overwrite mode
writer = FormatTableWrite(table=format_table, overwrite=True)
writer.write_pandas(df)
writer.prepare_commit()
writer.close()

# Write to partitioned table
partitioned_table = format_table  # has partition_keys = ["date"]
df = pd.DataFrame({
    "id": [1, 2, 3, 4],
    "name": ["A", "B", "C", "D"],
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"]
})

writer = FormatTableWrite(table=partitioned_table)
writer.write_pandas(df)
# Creates: /location/date=2024-01-01/data-*.parquet
#          /location/date=2024-01-02/data-*.parquet

# Write Arrow table
arrow_table = pa.table({
    "col1": [1, 2, 3],
    "col2": ["x", "y", "z"]
})

writer = FormatTableWrite(table=format_table)
writer.write_arrow(arrow_table)

# Write Arrow batch
for batch in arrow_table.to_batches():
    writer.write_arrow_batch(batch)

# Overwrite specific partition
writer = FormatTableWrite(
    table=partitioned_table,
    overwrite=True,
    static_partitions={"date": "2024-01-01"}
)
writer.write_pandas(df)
# Only overwrites date=2024-01-01 partition

# TEXT format example
text_table = format_table  # format=Format.TEXT
df = pd.DataFrame({"content": ["line1", "line2", "line3"]})
writer = FormatTableWrite(table=text_table)
writer.write_pandas(df)
# Writes text file with one line per row

print(f"Written files: {[msg.written_paths for msg in commit_messages]}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment