Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon FormatTableRead

From Leeroopedia


Knowledge Sources
Domains Format Tables, Data Reading
Last Updated 2026-02-08 00:00 GMT

Overview

FormatTableRead reads data from format tables in various file formats (Parquet, ORC, CSV, JSON, Text) and converts to Arrow or Pandas.

Description

The FormatTableRead class provides comprehensive functionality for reading data from Apache Paimon format tables. It supports multiple file formats with automatic format detection and handles both full table scans and projected reads with optional row limits.

The implementation reads data files into PyArrow tables with support for partition column injection, column projection, and format-specific parsing. For Parquet and ORC formats, it uses PyArrow's native readers. For CSV, it supports both PyArrow CSV reader and Pandas fallback. JSON uses line-delimited JSON parsing, and TEXT format handles single-column string data with configurable line delimiters.

The class properly handles partition keys by appending partition values as columns when they're not present in the data files. It supports reading to Arrow tables, Pandas DataFrames, or row-by-row iterators. The implementation includes empty table handling with proper schema preservation and limit enforcement across multiple splits.

Usage

Use FormatTableRead when you need to read data from external file formats, convert between data formats, or implement custom readers for format tables with specific projection, filtering, or limit requirements.

Code Reference

Source Location

Signature

class FormatTableRead:
    """Reader for format tables with Arrow/Pandas output."""

    def __init__(
        self,
        table: FormatTable,
        projection: Optional[List[str]] = None,
        limit: Optional[int] = None,
    ):
        """Initialize with table, optional projection and limit."""

    def to_arrow(self, splits: List[FormatDataSplit]) -> pyarrow.Table:
        """Read splits and return as Arrow table."""

    def to_pandas(self, splits: List[FormatDataSplit]) -> pandas.DataFrame:
        """Read splits and return as Pandas DataFrame."""

    def to_iterator(self, splits: List[FormatDataSplit]) -> Iterator[Any]:
        """Read splits and return as row-by-row iterator."""

Import

from pypaimon.table.format.format_table_read import FormatTableRead

I/O Contract

Inputs

Name Type Required Description
table FormatTable Yes Format table to read from
projection List[str] No Column names to project
limit int No Maximum rows to return
splits List[FormatDataSplit] Yes Data splits to read

Outputs

Name Type Description
arrow_table pyarrow.Table Data as Arrow table
dataframe pandas.DataFrame Data as Pandas DataFrame
iterator Iterator[Any] Row-by-row iterator

Usage Examples

from pypaimon.table.format.format_table_read import FormatTableRead

# Create read with projection and limit
read = FormatTableRead(
    table=format_table,
    projection=["id", "name", "age"],
    limit=1000
)

# Scan for splits
scan = format_table.new_read_builder().new_scan()
splits = scan.plan().splits()

# Read to Arrow table
arrow_table = read.to_arrow(splits)
print(f"Read {arrow_table.num_rows} rows")
print(f"Columns: {arrow_table.column_names}")

# Read to Pandas DataFrame
df = read.to_pandas(splits)
print(df.head())

# Read as iterator for memory-efficient processing
for batch in read.to_iterator(splits):
    # Process one row at a time
    print(batch)

# Read all columns without limit
read_all = FormatTableRead(table=format_table)
full_data = read_all.to_arrow(splits)

# With partition columns
partitioned_table = format_table  # has partition_keys = ["date", "region"]
read = FormatTableRead(
    table=partitioned_table,
    projection=["id", "name", "date", "region"]
)
data = read.to_arrow(splits)
# Partition columns are automatically added from split metadata

# TEXT format example
text_table = format_table  # format=Format.TEXT
read = FormatTableRead(table=text_table)
# Reads single column with line-delimited text
text_data = read.to_arrow(splits)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment