Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Paimon FormatPyarrowReader

From Leeroopedia


Knowledge Sources
Domains Data Reading, File Formats
Last Updated 2026-02-08 00:00 GMT

Overview

FormatPyArrowReader reads Parquet or ORC files using PyArrow with predicate pushdown and field projection support.

Description

FormatPyArrowReader is a RecordBatchReader implementation that leverages PyArrow's dataset API to read Parquet and ORC format files. It creates a PyArrow dataset scanner with specified column projection and predicate filters, providing efficient file reading with pushdown capabilities directly to the storage layer.

The reader handles schema evolution gracefully by distinguishing between fields that exist in the file and fields that are missing. For missing fields, it automatically creates null-filled columns with appropriate types, ensuring that the output RecordBatch always matches the expected schema regardless of the file's actual schema.

This implementation uses PyArrow's optimized file readers and takes advantage of features like column pruning, predicate pushdown, and batch size control. It supports both Parquet and ORC formats through PyArrow's unified dataset interface.

Usage

Use FormatPyArrowReader when reading Parquet or ORC format files in Paimon tables. It is automatically selected by the table scan based on the file format, providing high-performance reading with automatic schema handling and filter pushdown capabilities.

Code Reference

Source Location

Signature

class FormatPyArrowReader(RecordBatchReader):
    """
    A Format Reader that reads record batch from a Parquet or ORC file using PyArrow,
    and filters it based on the provided predicate and projection.
    """

    def __init__(self, file_io: FileIO, file_format: str, file_path: str,
                 read_fields: List[str],
                 push_down_predicate: Any, batch_size: int = 1024):
        ...

    def read_arrow_batch(self) -> Optional[RecordBatch]:
        """Read next batch with missing fields filled as nulls."""
        ...

    def close(self):
        """Close the reader and release resources."""
        ...

Import

from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader

I/O Contract

Inputs

Name Type Required Description
file_io FileIO Yes File I/O abstraction for accessing the file
file_format str Yes File format ("parquet" or "orc")
file_path str Yes Path to the data file
read_fields List[str] Yes List of field names to read (projection)
push_down_predicate Any No PyArrow predicate for row filtering
batch_size int No Number of rows per batch (default 1024)

Outputs

Name Type Description
RecordBatch Optional[RecordBatch] PyArrow RecordBatch with all requested fields (nulls for missing), or None at EOF

Usage Examples

from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader
from pypaimon.common.file_io import FileIO
import pyarrow.compute as pc

# Create file I/O
file_io = FileIO.get("file:///data/table")

# Define projection - includes fields that may not exist in file
read_fields = ["id", "name", "age", "new_column"]

# Create predicate: age > 18
predicate = pc.field("age") > 18

# Create Parquet reader
reader = FormatPyArrowReader(
    file_io=file_io,
    file_format="parquet",
    file_path="/data/table/file.parquet",
    read_fields=read_fields,
    push_down_predicate=predicate,
    batch_size=2048
)

# Read batches - missing columns will be filled with nulls
while True:
    batch = reader.read_arrow_batch()
    if batch is None:
        break
    # batch will have all 4 columns, with new_column as nulls
    print(f"Schema: {batch.schema}")
    print(f"Rows: {batch.num_rows}")

reader.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment