Principle:Apache Paimon Distributed Dataset Creation

Knowledge Sources	Apache Paimon Ray Data
Domains	Data_Lake, Distributed_Computing
Last Updated	2026-02-07 00:00 GMT

Overview

Mechanism for converting Paimon table splits into distributed Ray Datasets for parallel processing.

Description

Distributed dataset creation bridges Paimon's scan-then-read pipeline with Ray's distributed data processing framework. After scan planning produces a list of splits, TableRead.to_ray() wraps each split as a Ray read task via RayDatasource. Ray then schedules these tasks across available workers, creating a distributed Dataset that can be processed in parallel. The RayDatasource adapter handles split distribution, Arrow schema propagation, and block management to ensure efficient parallelism.

The pipeline consists of:

Scan planning - produces a list of splits describing the data to read
Datasource creation - wraps splits into a RayDatasource adapter
Task scheduling - Ray distributes read tasks across workers
Block assembly - each worker produces Arrow RecordBatches that form Dataset blocks

Usage

Use this principle when processing large Paimon tables that benefit from distributed execution. The resulting Ray Dataset supports map, filter, groupby, and other parallel transformations.

Theoretical Basis

Follows the data-parallel processing model where independent data partitions (splits) are processed concurrently across workers. The Datasource abstraction in Ray Data provides a pluggable interface for custom data sources. Paimon's split-based architecture maps naturally to Ray's block-based parallel execution model.

Key properties of the data-parallel model:

Independence - each split can be read without coordination with other splits
Locality - Ray scheduler can co-locate tasks with data when possible
Scalability - adding workers linearly increases read throughput
Fault tolerance - failed read tasks can be retried independently

Related Pages

Implemented By

Implementation:Apache_Paimon_TableRead_To_Ray

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment