Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Ray Data Source Reading

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Data_Ingestion
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for reading external data sources (JSON, CSV, Parquet) into distributed Ray Datasets for ingestion into data lake tables.

Description

Before data can be written to Paimon tables via Ray, it must be loaded into a Ray Dataset from an external source. Ray Data provides read functions (read_json, read_csv, read_parquet) that parallelize data loading across workers. The loaded Dataset can then be transformed and written to Paimon via the write_ray() sink. This is the entry point for ETL pipelines that move data from external formats into Paimon's managed table format.

Ray Data's read functions handle:

  • Distributed file discovery: Glob patterns and directory scanning across local or cloud storage (S3, GCS, HDFS).
  • Parallel reading: Multiple Ray workers read file chunks concurrently, controlled by the concurrency parameter.
  • Schema inference: Automatic detection of column names and types from the source data format.
  • Lazy execution: Datasets are lazily evaluated, meaning data is not fully materialized until a consuming operation (such as write_datasink) triggers execution.

The resulting ray.data.Dataset is a distributed, immutable collection of Arrow-backed record batches that can be transformed (filtered, mapped, cast) before writing to a Paimon table.

Usage

Use when ingesting external data files into Paimon tables via distributed Ray processing. This is the first step in any Ray-based Paimon ingestion pipeline:

  1. Read external data into a Ray Dataset using ray.data.read_json(), ray.data.read_csv(), or ray.data.read_parquet().
  2. Optionally transform or align the schema of the Dataset.
  3. Write the Dataset to a Paimon table via write_ray().

Theoretical Basis

This principle follows the extract-transform-load (ETL) pattern where extraction is parallelized via distributed file reading. The extraction phase leverages Ray's distributed task scheduling to read data in parallel across multiple workers, maximizing I/O throughput. Each worker reads a subset of the input files (or file chunks), producing Arrow record batches that form the distributed Dataset.

Key theoretical properties:

  • Data parallelism: Input files are partitioned across workers, enabling linear scaling of read throughput with available resources.
  • Fault tolerance: Ray's task retry mechanism handles transient read failures without restarting the entire pipeline.
  • Format abstraction: The read functions abstract away format-specific parsing (JSON line splitting, CSV delimiter handling, Parquet column pruning), providing a uniform Dataset interface regardless of the source format.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment