Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Data Retrieval

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Table_Format
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for reading planned splits into in-memory data structures such as PyArrow Tables and pandas DataFrames.

Description

Data retrieval is the final step in the read pipeline where planned splits are materialized into usable data. TableRead converts splits into Arrow RecordBatches, which can be collected into PyArrow Tables, pandas DataFrames, streaming RecordBatchReaders, or row-level iterators. The read process handles data file format differences (Avro, Parquet, ORC, Lance), applies deletion vectors for merge-on-read semantics, and supports data evolution (schema changes across snapshots).

Each split is processed independently, meaning the read operation is inherently parallelizable. The choice of output format determines the memory characteristics of the read: to_arrow() and to_pandas() materialize all data into memory at once, while to_arrow_batch_reader() and to_iterator() provide streaming access suitable for large datasets that do not fit in memory.

Usage

Use this principle after scan planning to retrieve actual data. Choose the output format based on downstream processing needs: PyArrow for in-memory analytics, pandas for data science workflows, RecordBatchReader for streaming large datasets, and iterator for row-level processing. The typical workflow involves: (1) obtaining splits from scan planning, (2) creating a TableRead from the ReadBuilder, and (3) calling the appropriate conversion method.

Theoretical Basis

Implements materialization of logical scan plans into physical data. Key concepts include:

  • Format abstraction: The read layer abstracts over multiple underlying file formats (Parquet, ORC, Avro, Lance), presenting a uniform Arrow-based interface to consumers.
  • Multiple output adapters: Adapter methods (to_arrow(), to_pandas(), to_arrow_batch_reader(), to_iterator()) convert the internal representation to the format most suitable for the downstream consumer.
  • Split-level independence: Each split can be read independently, enabling parallel processing. Results from individual splits are concatenated into the final output.
  • Merge-on-read: For tables with primary keys, the read process applies deletion vectors and merges multiple versions of the same row to present a consistent, deduplicated view.
  • Schema evolution: The read process handles schema differences across snapshots, applying column additions, removals, and type promotions transparently.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment