Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon TableRead Multi Format

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Columnar_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for reading Paimon Lance tables into PyArrow, pandas, and Ray Dataset formats.

Description

TableRead provides to_arrow(), to_pandas(), and to_ray() methods that all operate on the same List[Split] from scan planning. For Lance-format tables, the internal FormatLanceReader reads Lance files and produces Arrow RecordBatches, which are then converted to the requested output format.

The conversion chain works as follows:

  • to_arrow() collects all RecordBatches from the format reader and concatenates them into a single pyarrow.Table
  • to_pandas() delegates to to_arrow().to_pandas() for efficient Arrow-to-pandas conversion
  • to_ray() creates a RayDatasource that distributes split reading across Ray workers
  • to_arrow_batch_reader() returns a streaming RecordBatchReader for memory-efficient iteration

Usage

Use this implementation when you need to read Lance-format Paimon table data and convert it to a specific output format for downstream processing.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/read/table_read.py:L33-176

Signature

class TableRead:
    def to_arrow(self, splits: List[Split]) -> Optional[pyarrow.Table]:
    def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:
    def to_ray(self, splits: List[Split], *, ray_remote_args=None,
               concurrency=None, override_num_blocks=None,
               **read_args) -> "ray.data.dataset.Dataset":
    def to_arrow_batch_reader(self, splits: List[Split]) -> pyarrow.ipc.RecordBatchReader:

Import

from pypaimon.read.table_read import TableRead

I/O Contract

Inputs

Name Type Required Description
splits List[Split] Yes List of splits obtained from scan planning via scan.plan().splits()
ray_remote_args Optional[Dict] No Additional arguments passed to Ray remote functions (to_ray only)
concurrency Optional[int] No Number of concurrent read tasks for Ray (to_ray only)
override_num_blocks Optional[int] No Override the number of output blocks for Ray parallelism (to_ray only)

Outputs

Name Type Description
(to_arrow) Optional[pyarrow.Table] PyArrow Table containing all data from the splits, or None if no data
(to_pandas) pandas.DataFrame pandas DataFrame with all data from the splits
(to_ray) ray.data.dataset.Dataset Ray Dataset for distributed processing of the split data
(to_arrow_batch_reader) pyarrow.ipc.RecordBatchReader Streaming reader for memory-efficient iteration over RecordBatches

Usage Examples

Basic Usage

read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()

# Read as PyArrow
arrow_table = reader.to_arrow(splits)

# Read as pandas
df = reader.to_pandas(splits)

# Read as Ray Dataset
ray_ds = reader.to_ray(splits)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment