Implementation:Apache Paimon Ray Data Read Json

Knowledge Sources	Apache Paimon Ray Data API
Domains	Data_Lake, Data_Ingestion
Last Updated	2026-02-07 00:00 GMT

Overview

External tool for reading JSON data files into distributed Ray Datasets for Paimon ingestion.

Description

ray.data.read_json() reads JSON files or directories into a distributed Dataset. It supports glob patterns, concurrent reading, and schema inference. This function is used as the data source in Paimon ingestion pipelines before calling write_ray() to write data into a Paimon table.

Key capabilities:

Glob pattern support: Accepts wildcards (e.g., s3://bucket/data/*.json) to read multiple files at once.
Concurrent reading: The concurrency parameter controls how many Ray tasks read files in parallel.
Schema inference: Automatically infers column names and types from the JSON structure using Apache Arrow's JSON parser.
Arrow integration: Additional keyword arguments are forwarded to Arrow's JSON parser (arrow_json_args), enabling fine-grained control over parsing behavior such as explicit schema specification and block size.

This is an external tool provided by the Ray Data library, not part of the Paimon codebase itself. It serves as the entry point for loading data that will subsequently be written to Paimon tables.

Usage

Use ray.data.read_json() as the first step in a Ray-based Paimon ingestion pipeline to load JSON data from local or cloud storage into a distributed Ray Dataset.

Code Reference

Source Location

Repository: Ray Project
Module: ray.data
External Reference: ray.data.read_json API Documentation

Signature

ray.data.read_json(
    paths: Union[str, List[str]],
    *,
    concurrency: Optional[int] = None,
    **arrow_json_args,
) -> ray.data.Dataset

Import

import ray.data

I/O Contract

Inputs

Name	Type	Required	Description
paths	Union[str, List[str]]	Yes	File path(s), directory path(s), or glob pattern(s) pointing to JSON files. Supports local paths, S3, GCS, and HDFS URIs.
concurrency	Optional[int]	No	Number of parallel read tasks. Controls the degree of parallelism for reading files across Ray workers.
arrow_json_args	**kwargs	No	Additional keyword arguments forwarded to Arrow's JSON parser (e.g., read_options, parse_options, block_size).

Outputs

Name	Type	Description
dataset	ray.data.Dataset	A distributed Ray Dataset containing the loaded JSON data as Arrow-backed record batches. Each row corresponds to a JSON object from the input files.

Usage Examples

Basic Usage

import ray

ray.init(ignore_reinit_error=True, num_cpus=4)
dataset = ray.data.read_json("s3://bucket/data/*.json", concurrency=4)
print(f"Loaded {dataset.count()} rows")

Reading from Local Directory

import ray

ray.init(ignore_reinit_error=True)

# Read all JSON files from a local directory
dataset = ray.data.read_json("/data/events/2026-02-07/")
print(dataset.schema())
print(f"Loaded {dataset.count()} rows from local directory")

Related Pages

Implements Principle

Principle:Apache_Paimon_Ray_Data_Source_Reading

Requires Environment

Environment:Apache_Paimon_Optional_Extensions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment