Implementation:Apache Paimon FormatReadBuilder
| Knowledge Sources | |
|---|---|
| Domains | Format Tables, Query Building |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FormatReadBuilder is a builder class for constructing read operations on format tables with support for filtering, projection, and limits.
Description
The FormatReadBuilder class provides a fluent API for configuring read operations on Apache Paimon format tables. It allows users to specify projections (column selection), filters (partition pruning), and limits (row count restrictions) before creating scan and read objects.
The builder supports partition filtering through predicates or direct partition specification dictionaries. When a predicate is provided, it attempts to extract partition specifications for efficient partition pruning. The builder maintains configuration state and creates configured FormatTableScan and FormatTableRead instances.
The class also provides access to predicate builders for constructing complex filter expressions and exposes the read type (list of projected fields) for query planning purposes.
Usage
Use FormatReadBuilder when reading from format tables (Parquet, ORC, CSV, JSON, Text) with specific column projections, partition filters, or row limits to optimize query performance and reduce data transfer.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/table/format/format_read_builder.py
Signature
class FormatReadBuilder:
"""Builder for constructing format table read operations."""
def __init__(self, table: FormatTable):
"""Initialize with a FormatTable instance."""
def with_filter(self, predicate: Predicate) -> "FormatReadBuilder":
"""Set partition filter from predicate."""
def with_projection(self, projection: List[str]) -> "FormatReadBuilder":
"""Set column projection."""
def with_limit(self, limit: int) -> "FormatReadBuilder":
"""Set row limit."""
def with_partition_filter(self, partition_spec: Optional[dict]) -> "FormatReadBuilder":
"""Set partition filter directly."""
def new_scan(self) -> FormatTableScan:
"""Create a new scan with current configuration."""
def new_read(self) -> FormatTableRead:
"""Create a new read with current configuration."""
def new_predicate_builder(self) -> PredicateBuilder:
"""Create a new predicate builder."""
def read_type(self) -> List[DataField]:
"""Get the list of fields that will be read."""
Import
from pypaimon.table.format.format_read_builder import FormatReadBuilder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table | FormatTable | Yes | Format table to read from |
| predicate | Predicate | No | Filter predicate for partition pruning |
| projection | List[str] | No | List of column names to project |
| limit | int | No | Maximum number of rows to return |
| partition_spec | dict | No | Direct partition specification for filtering |
Outputs
| Name | Type | Description |
|---|---|---|
| builder | FormatReadBuilder | Builder with updated configuration |
| scan | FormatTableScan | Configured scan object |
| read | FormatTableRead | Configured read object |
| predicate_builder | PredicateBuilder | New predicate builder |
| fields | List[DataField] | Fields to be read after projection |
Usage Examples
from pypaimon.table.format.format_read_builder import FormatReadBuilder
# Create read builder
builder = table.new_read_builder()
# Configure with projection and limit
builder = (builder
.with_projection(["id", "name", "age"])
.with_limit(1000))
# Create scan and read
scan = builder.new_scan()
read = builder.new_read()
# Execute query
splits = scan.plan().splits()
df = read.to_pandas(splits)
# With partition filter using predicate
predicate_builder = builder.new_predicate_builder()
predicate = predicate_builder.equal("date", "2024-01-01")
builder = builder.with_filter(predicate)
# With direct partition filter
builder = builder.with_partition_filter({"date": "2024-01-01", "region": "us"})
# Get read type
fields = builder.read_type()
print(f"Will read fields: {[f.name for f in fields]}")
# Full example
builder = (table.new_read_builder()
.with_projection(["user_id", "event_type", "timestamp"])
.with_partition_filter({"date": "2024-01-01"})
.with_limit(10000))
scan = builder.new_scan()
read = builder.new_read()
# To Arrow
splits = scan.plan().splits()
arrow_table = read.to_arrow(splits)
print(f"Read {arrow_table.num_rows} rows")