Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon TableRead To Arrow

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Table_Format
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for converting Paimon table splits into PyArrow Tables, pandas DataFrames, and streaming readers.

Description

TableRead provides to_arrow(), to_pandas(), to_arrow_batch_reader(), and to_iterator() methods for materializing scan plan splits into usable data structures. It creates the appropriate SplitRead implementation (RawFileSplitRead, MergeFileSplitRead, or DataEvolutionSplitRead) based on table type and configuration. Each split is processed independently, with results concatenated into the final output. The to_arrow() method returns None if no data matches the scan plan.

Usage

Use this implementation after obtaining splits from TableScan.plan().splits(). Create a TableRead via read_builder.new_read(), then call the appropriate output method based on your downstream data processing needs.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/read/table_read.py
  • Lines: L33-219

Signature

class TableRead:
    def __init__(self, table, predicate: Optional[Predicate],
                 read_type: List[DataField]):

    def to_arrow(self, splits: List[Split]) -> Optional[pyarrow.Table]:
    def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:
    def to_arrow_batch_reader(self, splits: List[Split]) -> pyarrow.ipc.RecordBatchReader:
    def to_iterator(self, splits: List[Split]) -> Iterator:

Import

from pypaimon.read.table_read import TableRead

I/O Contract

Inputs

Name Type Required Description
splits List[Split] Yes List of splits obtained from TableScan.plan().splits()

Outputs

Name Type Description
to_arrow return Optional[pyarrow.Table] PyArrow Table containing all matching rows, or None if no data matches
to_pandas return pandas.DataFrame pandas DataFrame containing all matching rows
to_arrow_batch_reader return pyarrow.ipc.RecordBatchReader Streaming reader that yields RecordBatches one at a time
to_iterator return Iterator Row-level iterator over matching data

Usage Examples

Basic Usage

# After scan planning
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
plan = scan.plan()
splits = plan.splits()

# Read as PyArrow Table
reader = read_builder.new_read()
arrow_table = reader.to_arrow(splits)
print(arrow_table.to_pandas())

# Or read as pandas directly
df = reader.to_pandas(splits)

# Or stream as RecordBatches
batch_reader = reader.to_arrow_batch_reader(splits)
for batch in iter(batch_reader.read_next_batch, None):
    process(batch)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment