Implementation:Apache Paimon FormatTableRead

Knowledge Sources	Apache_Paimon
Domains	Format Tables, Data Reading
Last Updated	2026-02-08 00:00 GMT

Overview

FormatTableRead reads data from format tables in various file formats (Parquet, ORC, CSV, JSON, Text) and converts to Arrow or Pandas.

Description

The FormatTableRead class provides comprehensive functionality for reading data from Apache Paimon format tables. It supports multiple file formats with automatic format detection and handles both full table scans and projected reads with optional row limits.

The implementation reads data files into PyArrow tables with support for partition column injection, column projection, and format-specific parsing. For Parquet and ORC formats, it uses PyArrow's native readers. For CSV, it supports both PyArrow CSV reader and Pandas fallback. JSON uses line-delimited JSON parsing, and TEXT format handles single-column string data with configurable line delimiters.

The class properly handles partition keys by appending partition values as columns when they're not present in the data files. It supports reading to Arrow tables, Pandas DataFrames, or row-by-row iterators. The implementation includes empty table handling with proper schema preservation and limit enforcement across multiple splits.

Usage

Use FormatTableRead when you need to read data from external file formats, convert between data formats, or implement custom readers for format tables with specific projection, filtering, or limit requirements.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/table/format/format_table_read.py

Signature

class FormatTableRead:
    """Reader for format tables with Arrow/Pandas output."""

    def __init__(
        self,
        table: FormatTable,
        projection: Optional[List[str]] = None,
        limit: Optional[int] = None,
    ):
        """Initialize with table, optional projection and limit."""

    def to_arrow(self, splits: List[FormatDataSplit]) -> pyarrow.Table:
        """Read splits and return as Arrow table."""

    def to_pandas(self, splits: List[FormatDataSplit]) -> pandas.DataFrame:
        """Read splits and return as Pandas DataFrame."""

    def to_iterator(self, splits: List[FormatDataSplit]) -> Iterator[Any]:
        """Read splits and return as row-by-row iterator."""

Import

from pypaimon.table.format.format_table_read import FormatTableRead

I/O Contract

Inputs

Name	Type	Required	Description
table	FormatTable	Yes	Format table to read from
projection	List[str]	No	Column names to project
limit	int	No	Maximum rows to return
splits	List[FormatDataSplit]	Yes	Data splits to read

Outputs

Name	Type	Description
arrow_table	pyarrow.Table	Data as Arrow table
dataframe	pandas.DataFrame	Data as Pandas DataFrame
iterator	Iterator[Any]	Row-by-row iterator

Usage Examples

from pypaimon.table.format.format_table_read import FormatTableRead

# Create read with projection and limit
read = FormatTableRead(
    table=format_table,
    projection=["id", "name", "age"],
    limit=1000
)

# Scan for splits
scan = format_table.new_read_builder().new_scan()
splits = scan.plan().splits()

# Read to Arrow table
arrow_table = read.to_arrow(splits)
print(f"Read {arrow_table.num_rows} rows")
print(f"Columns: {arrow_table.column_names}")

# Read to Pandas DataFrame
df = read.to_pandas(splits)
print(df.head())

# Read as iterator for memory-efficient processing
for batch in read.to_iterator(splits):
    # Process one row at a time
    print(batch)

# Read all columns without limit
read_all = FormatTableRead(table=format_table)
full_data = read_all.to_arrow(splits)

# With partition columns
partitioned_table = format_table  # has partition_keys = ["date", "region"]
read = FormatTableRead(
    table=partitioned_table,
    projection=["id", "name", "date", "region"]
)
data = read.to_arrow(splits)
# Partition columns are automatically added from split metadata

# TEXT format example
text_table = format_table  # format=Format.TEXT
read = FormatTableRead(table=text_table)
# Reads single column with line-delimited text
text_data = read.to_arrow(splits)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment