Implementation:Apache Paimon ReadBuilder With Projection
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Columnar_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for configuring column projection on Paimon table reads for Lance-format tables.
Description
ReadBuilder.with_projection() accepts a list of column names to select. The read_type() method resolves the projection against the table schema to produce the list of DataField objects for the selected columns. For Lance tables, this propagates to FormatLanceReader which reads only the specified columns from Lance files.
The projection is applied at the lowest level of the read pipeline, ensuring that the Lance file reader only reads the data for the projected columns from disk. This avoids reading unnecessary column data and reduces both I/O bandwidth and memory consumption.
Usage
Use this implementation when reading from Lance-format tables and only a subset of columns is needed. The projection is configured on the ReadBuilder before creating the scan and reader.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/read/read_builder.py:L46-85
Signature
class ReadBuilder:
def with_projection(self, projection: List[str]) -> 'ReadBuilder':
def read_type(self) -> List[DataField]:
Import
from pypaimon.read.read_builder import ReadBuilder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| projection | List[str] | Yes | List of column names to select from the table schema |
Outputs
| Name | Type | Description |
|---|---|---|
| (with_projection) | ReadBuilder | Configured ReadBuilder instance with the column projection applied |
| (read_type) | List[DataField] | List of DataField objects for the projected columns, resolved against the table schema |
Usage Examples
Basic Usage
read_builder = table.new_read_builder()
read_builder = read_builder.with_projection(['id', 'name', 'value'])
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
df = reader.to_pandas(splits) # Only 3 columns