Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Paimon TableRead To Pandas Verification

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Data_Ingestion
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for reading Paimon table data as pandas DataFrame for post-ingestion verification.

Description

Uses the standard Paimon read pipeline: table.new_read_builder() creates a read builder, new_scan() creates a scanner, plan() generates a scan plan, and splits() returns the list of data splits. The new_read().to_pandas(splits) call then reads all splits and returns a single pandas DataFrame.

In the verification context, this confirms that the Ray-written data is correctly visible after commit. The to_pandas() method:

  1. Reads each split from the file store.
  2. Converts the data from Paimon's internal format to Arrow record batches.
  3. Concatenates all record batches into a single Arrow Table.
  4. Converts the Arrow Table to a pandas DataFrame using to_pandas().

The resulting DataFrame can be inspected for row counts, column types, and data values to validate the ingestion.

Usage

Use as the final step in a Ray-based Paimon ingestion pipeline to verify that data was correctly written and committed. Call after write_ray() completes successfully.

Code Reference

Source Location

  • Repository: Apache Paimon
  • File: paimon-python/pypaimon/read/table_read.py:L119-121

Signature

class TableRead:
    def to_pandas(self, splits: List[Split]) -> pandas.DataFrame:

Import

from pypaimon.read.table_read import TableRead

I/O Contract

Inputs

Name Type Required Description
splits List[Split] Yes List of data splits obtained from scan.plan().splits(). Each split represents a range of data in the file store to read.

Outputs

Name Type Description
df pandas.DataFrame A pandas DataFrame containing all rows from the specified splits. Columns correspond to the Paimon table's schema fields, with types converted from Paimon/Arrow to pandas types.

Usage Examples

Basic Usage

# Verify ingestion
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
plan = scan.plan()
splits = plan.splits()

reader = read_builder.new_read()
df = reader.to_pandas(splits)
print(f"Verified: {len(df)} rows ingested")
print(df.head())
print(df.dtypes)

Comprehensive Verification

# Full verification with assertions
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
splits = scan.plan().splits()
reader = read_builder.new_read()
df = reader.to_pandas(splits)

# Validate row count
expected_rows = dataset.count()
actual_rows = len(df)
assert actual_rows == expected_rows, (
    f"Row count mismatch: expected {expected_rows}, got {actual_rows}"
)

# Validate schema
expected_columns = ['user_id', 'event', 'timestamp', 'value']
assert list(df.columns) == expected_columns, (
    f"Column mismatch: expected {expected_columns}, got {list(df.columns)}"
)

# Validate no null primary keys
assert df['user_id'].notna().all(), "Found null values in primary key column"

print(f"Verification passed: {actual_rows} rows, {len(df.columns)} columns")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment