Implementation:Apache Paimon FormatPyarrowReader
| Knowledge Sources | |
|---|---|
| Domains | Data Reading, File Formats |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FormatPyArrowReader reads Parquet or ORC files using PyArrow with predicate pushdown and field projection support.
Description
FormatPyArrowReader is a RecordBatchReader implementation that leverages PyArrow's dataset API to read Parquet and ORC format files. It creates a PyArrow dataset scanner with specified column projection and predicate filters, providing efficient file reading with pushdown capabilities directly to the storage layer.
The reader handles schema evolution gracefully by distinguishing between fields that exist in the file and fields that are missing. For missing fields, it automatically creates null-filled columns with appropriate types, ensuring that the output RecordBatch always matches the expected schema regardless of the file's actual schema.
This implementation uses PyArrow's optimized file readers and takes advantage of features like column pruning, predicate pushdown, and batch size control. It supports both Parquet and ORC formats through PyArrow's unified dataset interface.
Usage
Use FormatPyArrowReader when reading Parquet or ORC format files in Paimon tables. It is automatically selected by the table scan based on the file format, providing high-performance reading with automatic schema handling and filter pushdown capabilities.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/read/reader/format_pyarrow_reader.py
Signature
class FormatPyArrowReader(RecordBatchReader):
"""
A Format Reader that reads record batch from a Parquet or ORC file using PyArrow,
and filters it based on the provided predicate and projection.
"""
def __init__(self, file_io: FileIO, file_format: str, file_path: str,
read_fields: List[str],
push_down_predicate: Any, batch_size: int = 1024):
...
def read_arrow_batch(self) -> Optional[RecordBatch]:
"""Read next batch with missing fields filled as nulls."""
...
def close(self):
"""Close the reader and release resources."""
...
Import
from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_io | FileIO | Yes | File I/O abstraction for accessing the file |
| file_format | str | Yes | File format ("parquet" or "orc") |
| file_path | str | Yes | Path to the data file |
| read_fields | List[str] | Yes | List of field names to read (projection) |
| push_down_predicate | Any | No | PyArrow predicate for row filtering |
| batch_size | int | No | Number of rows per batch (default 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| RecordBatch | Optional[RecordBatch] | PyArrow RecordBatch with all requested fields (nulls for missing), or None at EOF |
Usage Examples
from pypaimon.read.reader.format_pyarrow_reader import FormatPyArrowReader
from pypaimon.common.file_io import FileIO
import pyarrow.compute as pc
# Create file I/O
file_io = FileIO.get("file:///data/table")
# Define projection - includes fields that may not exist in file
read_fields = ["id", "name", "age", "new_column"]
# Create predicate: age > 18
predicate = pc.field("age") > 18
# Create Parquet reader
reader = FormatPyArrowReader(
file_io=file_io,
file_format="parquet",
file_path="/data/table/file.parquet",
read_fields=read_fields,
push_down_predicate=predicate,
batch_size=2048
)
# Read batches - missing columns will be filled with nulls
while True:
batch = reader.read_arrow_batch()
if batch is None:
break
# batch will have all 4 columns, with new_column as nulls
print(f"Schema: {batch.schema}")
print(f"Rows: {batch.num_rows}")
reader.close()