Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Multi Format Reading

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Columnar_Storage
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for reading Paimon table data into multiple output formats (PyArrow, pandas, Ray) from a single scan plan.

Description

Paimon's read pipeline supports materializing data into multiple formats from the same set of splits. TableRead provides to_arrow(), to_pandas(), to_ray(), to_arrow_batch_reader(), and to_duckdb() methods. For Lance-format tables, the underlying FormatLanceReader handles the Lance-specific file reading while the TableRead adapter converts to the requested output format. This flexibility allows users to choose the best format for their downstream processing needs.

The read pipeline follows a layered architecture:

  1. Scan planning: The scan produces a list of Split objects describing which data files to read
  2. Format-specific reading: FormatLanceReader (or other format readers) reads the raw data files into Arrow RecordBatches
  3. Format conversion: TableRead converts the Arrow RecordBatches into the requested output format

Usage

Use when data needs to be consumed in different formats for different processing stages. Common scenarios include:

  • pandas for exploration: Interactive data analysis and visualization in notebooks
  • PyArrow for interop: Passing data to other Arrow-compatible systems (DuckDB, Polars)
  • Ray for distributed processing: Large-scale parallel analytics and ML training
  • Arrow batch reader for streaming: Processing data in a streaming fashion without loading all into memory

Theoretical Basis

The adapter pattern allows a single read implementation to produce multiple output representations. The internal Arrow RecordBatch format serves as the universal intermediate representation.

Apache Arrow's columnar in-memory format provides a zero-copy interoperability layer between different processing engines. By using Arrow as the canonical intermediate format, Paimon avoids redundant serialization and deserialization when converting between output formats:

  • to_pandas() delegates to Arrow's built-in to_pandas() conversion
  • to_ray() wraps Arrow data in Ray's distributed dataset abstraction
  • to_duckdb() leverages DuckDB's native Arrow integration for zero-copy queries

This design follows the universal intermediate representation pattern, where a single well-defined format serves as the bridge between multiple systems.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment