Implementation:Apache Paimon TableRead Multi Format
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Columnar_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for reading Paimon Lance tables into PyArrow, pandas, and Ray Dataset formats.
Description
TableRead provides to_arrow(), to_pandas(), and to_ray() methods that all operate on the same List[Split] from scan planning. For Lance-format tables, the internal FormatLanceReader reads Lance files and produces Arrow RecordBatches, which are then converted to the requested output format.
The conversion chain works as follows:
- to_arrow() collects all RecordBatches from the format reader and concatenates them into a single pyarrow.Table
- to_pandas() delegates to to_arrow().to_pandas() for efficient Arrow-to-pandas conversion
- to_ray() creates a RayDatasource that distributes split reading across Ray workers
- to_arrow_batch_reader() returns a streaming RecordBatchReader for memory-efficient iteration
Usage
Use this implementation when you need to read Lance-format Paimon table data and convert it to a specific output format for downstream processing.
Code Reference
Source Location
- Repository: Apache Paimon
- File: paimon-python/pypaimon/read/table_read.py:L33-176
Signature
class TableRead:
def to_arrow(self, splits: List[Split]) -> Optional[pyarrow.Table]:
def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:
def to_ray(self, splits: List[Split], *, ray_remote_args=None,
concurrency=None, override_num_blocks=None,
**read_args) -> "ray.data.dataset.Dataset":
def to_arrow_batch_reader(self, splits: List[Split]) -> pyarrow.ipc.RecordBatchReader:
Import
from pypaimon.read.table_read import TableRead
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| splits | List[Split] | Yes | List of splits obtained from scan planning via scan.plan().splits() |
| ray_remote_args | Optional[Dict] | No | Additional arguments passed to Ray remote functions (to_ray only) |
| concurrency | Optional[int] | No | Number of concurrent read tasks for Ray (to_ray only) |
| override_num_blocks | Optional[int] | No | Override the number of output blocks for Ray parallelism (to_ray only) |
Outputs
| Name | Type | Description |
|---|---|---|
| (to_arrow) | Optional[pyarrow.Table] | PyArrow Table containing all data from the splits, or None if no data |
| (to_pandas) | pandas.DataFrame | pandas DataFrame with all data from the splits |
| (to_ray) | ray.data.dataset.Dataset | Ray Dataset for distributed processing of the split data |
| (to_arrow_batch_reader) | pyarrow.ipc.RecordBatchReader | Streaming reader for memory-efficient iteration over RecordBatches |
Usage Examples
Basic Usage
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
# Read as PyArrow
arrow_table = reader.to_arrow(splits)
# Read as pandas
df = reader.to_pandas(splits)
# Read as Ray Dataset
ray_ds = reader.to_ray(splits)
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment