Implementation:Apache Paimon TableRead Multi Format

Knowledge Sources	Apache Paimon
Domains	Data_Lake, Columnar_Storage
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for reading Paimon Lance tables into PyArrow, pandas, and Ray Dataset formats.

Description

TableRead provides to_arrow(), to_pandas(), and to_ray() methods that all operate on the same List[Split] from scan planning. For Lance-format tables, the internal FormatLanceReader reads Lance files and produces Arrow RecordBatches, which are then converted to the requested output format.

The conversion chain works as follows:

to_arrow() collects all RecordBatches from the format reader and concatenates them into a single pyarrow.Table
to_pandas() delegates to to_arrow().to_pandas() for efficient Arrow-to-pandas conversion
to_ray() creates a RayDatasource that distributes split reading across Ray workers
to_arrow_batch_reader() returns a streaming RecordBatchReader for memory-efficient iteration

Usage

Use this implementation when you need to read Lance-format Paimon table data and convert it to a specific output format for downstream processing.

Code Reference

Source Location

Repository: Apache Paimon
File: paimon-python/pypaimon/read/table_read.py:L33-176

Signature

class TableRead:
    def to_arrow(self, splits: List[Split]) -> Optional[pyarrow.Table]:
    def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:
    def to_ray(self, splits: List[Split], *, ray_remote_args=None,
               concurrency=None, override_num_blocks=None,
               **read_args) -> "ray.data.dataset.Dataset":
    def to_arrow_batch_reader(self, splits: List[Split]) -> pyarrow.ipc.RecordBatchReader:

Import

from pypaimon.read.table_read import TableRead

I/O Contract

Inputs

Name	Type	Required	Description
splits	List[Split]	Yes	List of splits obtained from scan planning via scan.plan().splits()
ray_remote_args	Optional[Dict]	No	Additional arguments passed to Ray remote functions (to_ray only)
concurrency	Optional[int]	No	Number of concurrent read tasks for Ray (to_ray only)
override_num_blocks	Optional[int]	No	Override the number of output blocks for Ray parallelism (to_ray only)

Outputs

Name	Type	Description
(to_arrow)	Optional[pyarrow.Table]	PyArrow Table containing all data from the splits, or None if no data
(to_pandas)	pandas.DataFrame	pandas DataFrame with all data from the splits
(to_ray)	ray.data.dataset.Dataset	Ray Dataset for distributed processing of the split data
(to_arrow_batch_reader)	pyarrow.ipc.RecordBatchReader	Streaming reader for memory-efficient iteration over RecordBatches

Usage Examples

Basic Usage

read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()

# Read as PyArrow
arrow_table = reader.to_arrow(splits)

# Read as pandas
df = reader.to_pandas(splits)

# Read as Ray Dataset
ray_ds = reader.to_ray(splits)

Related Pages

Implements Principle

Principle:Apache_Paimon_Multi_Format_Reading

Requires Environment

Environment:Apache_Paimon_Python_Core_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment