Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon Ray Data Read Json

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Data_Ingestion
Last Updated 2026-02-07 00:00 GMT

Overview

External tool for reading JSON data files into distributed Ray Datasets for Paimon ingestion.

Description

ray.data.read_json() reads JSON files or directories into a distributed Dataset. It supports glob patterns, concurrent reading, and schema inference. This function is used as the data source in Paimon ingestion pipelines before calling write_ray() to write data into a Paimon table.

Key capabilities:

  • Glob pattern support: Accepts wildcards (e.g., s3://bucket/data/*.json) to read multiple files at once.
  • Concurrent reading: The concurrency parameter controls how many Ray tasks read files in parallel.
  • Schema inference: Automatically infers column names and types from the JSON structure using Apache Arrow's JSON parser.
  • Arrow integration: Additional keyword arguments are forwarded to Arrow's JSON parser (arrow_json_args), enabling fine-grained control over parsing behavior such as explicit schema specification and block size.

This is an external tool provided by the Ray Data library, not part of the Paimon codebase itself. It serves as the entry point for loading data that will subsequently be written to Paimon tables.

Usage

Use ray.data.read_json() as the first step in a Ray-based Paimon ingestion pipeline to load JSON data from local or cloud storage into a distributed Ray Dataset.

Code Reference

Source Location

Signature

ray.data.read_json(
    paths: Union[str, List[str]],
    *,
    concurrency: Optional[int] = None,
    **arrow_json_args,
) -> ray.data.Dataset

Import

import ray.data

I/O Contract

Inputs

Name Type Required Description
paths Union[str, List[str]] Yes File path(s), directory path(s), or glob pattern(s) pointing to JSON files. Supports local paths, S3, GCS, and HDFS URIs.
concurrency Optional[int] No Number of parallel read tasks. Controls the degree of parallelism for reading files across Ray workers.
arrow_json_args **kwargs No Additional keyword arguments forwarded to Arrow's JSON parser (e.g., read_options, parse_options, block_size).

Outputs

Name Type Description
dataset ray.data.Dataset A distributed Ray Dataset containing the loaded JSON data as Arrow-backed record batches. Each row corresponds to a JSON object from the input files.

Usage Examples

Basic Usage

import ray

ray.init(ignore_reinit_error=True, num_cpus=4)
dataset = ray.data.read_json("s3://bucket/data/*.json", concurrency=4)
print(f"Loaded {dataset.count()} rows")

Reading from Local Directory

import ray

ray.init(ignore_reinit_error=True)

# Read all JSON files from a local directory
dataset = ray.data.read_json("/data/events/2026-02-07/")
print(dataset.schema())
print(f"Loaded {dataset.count()} rows from local directory")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment