Implementation:Apache Paimon FormatPyarrowReader

Knowledge Sources	Apache_Paimon
Domains	Data Reading, File Formats
Last Updated	2026-02-08 00:00 GMT

Overview

FormatPyArrowReader reads Parquet or ORC files using PyArrow with predicate pushdown and field projection support.

Description

FormatPyArrowReader is a RecordBatchReader implementation that leverages PyArrow's dataset API to read Parquet and ORC format files. It creates a PyArrow dataset scanner with specified column projection and predicate filters, providing efficient file reading with pushdown capabilities directly to the storage layer.

The reader handles schema evolution gracefully by distinguishing between fields that exist in the file and fields that are missing. For missing fields, it automatically creates null-filled columns with appropriate types, ensuring that the output RecordBatch always matches the expected schema regardless of the file's actual schema.

This implementation uses PyArrow's optimized file readers and takes advantage of features like column pruning, predicate pushdown, and batch size control. It supports both Parquet and ORC formats through PyArrow's unified dataset interface.

Usage

Use FormatPyArrowReader when reading Parquet or ORC format files in Paimon tables. It is automatically selected by the table scan based on the file format, providing high-performance reading with automatic schema handling and filter pushdown capabilities.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/read/reader/format_pyarrow_reader.py

Signature

class FormatPyArrowReader(RecordBatchReader):
    """
    A Format Reader that reads record batch from a Parquet or ORC file using PyArrow,
    and filters it based on the provided predicate and projection.
    """

    def __init__(self, file_io: FileIO, file_format: str, file_path: str,
                 read_fields: List[str],
                 push_down_predicate: Any, batch_size: int = 1024):
        ...

    def read_arrow_batch(self) -> Optional[RecordBatch]:
        """Read next batch with missing fields filled as nulls."""
        ...

    def close(self):
        """Close the reader and release resources."""
        ...

Import

from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader

I/O Contract

Inputs

Name	Type	Required	Description
file_io	FileIO	Yes	File I/O abstraction for accessing the file
file_format	str	Yes	File format ("parquet" or "orc")
file_path	str	Yes	Path to the data file
read_fields	List[str]	Yes	List of field names to read (projection)
push_down_predicate	Any	No	PyArrow predicate for row filtering
batch_size	int	No	Number of rows per batch (default 1024)

Outputs

Name	Type	Description
RecordBatch	Optional[RecordBatch]	PyArrow RecordBatch with all requested fields (nulls for missing), or None at EOF

Usage Examples

from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader
from pypaimon.common.file_io import FileIO
import pyarrow.compute as pc

# Create file I/O
file_io = FileIO.get("file:///data/table")

# Define projection - includes fields that may not exist in file
read_fields = ["id", "name", "age", "new_column"]

# Create predicate: age > 18
predicate = pc.field("age") > 18

# Create Parquet reader
reader = FormatPyArrowReader(
    file_io=file_io,
    file_format="parquet",
    file_path="/data/table/file.parquet",
    read_fields=read_fields,
    push_down_predicate=predicate,
    batch_size=2048
)

# Read batches - missing columns will be filled with nulls
while True:
    batch = reader.read_arrow_batch()
    if batch is None:
        break
    # batch will have all 4 columns, with new_column as nulls
    print(f"Schema: {batch.schema}")
    print(f"Rows: {batch.num_rows}")

reader.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment