Implementation:Apache Paimon Ray Data Read Json
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Data_Ingestion |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
External tool for reading JSON data files into distributed Ray Datasets for Paimon ingestion.
Description
ray.data.read_json() reads JSON files or directories into a distributed Dataset. It supports glob patterns, concurrent reading, and schema inference. This function is used as the data source in Paimon ingestion pipelines before calling write_ray() to write data into a Paimon table.
Key capabilities:
- Glob pattern support: Accepts wildcards (e.g., s3://bucket/data/*.json) to read multiple files at once.
- Concurrent reading: The concurrency parameter controls how many Ray tasks read files in parallel.
- Schema inference: Automatically infers column names and types from the JSON structure using Apache Arrow's JSON parser.
- Arrow integration: Additional keyword arguments are forwarded to Arrow's JSON parser (arrow_json_args), enabling fine-grained control over parsing behavior such as explicit schema specification and block size.
This is an external tool provided by the Ray Data library, not part of the Paimon codebase itself. It serves as the entry point for loading data that will subsequently be written to Paimon tables.
Usage
Use ray.data.read_json() as the first step in a Ray-based Paimon ingestion pipeline to load JSON data from local or cloud storage into a distributed Ray Dataset.
Code Reference
Source Location
- Repository: Ray Project
- Module: ray.data
- External Reference: ray.data.read_json API Documentation
Signature
ray.data.read_json(
paths: Union[str, List[str]],
*,
concurrency: Optional[int] = None,
**arrow_json_args,
) -> ray.data.Dataset
Import
import ray.data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| paths | Union[str, List[str]] | Yes | File path(s), directory path(s), or glob pattern(s) pointing to JSON files. Supports local paths, S3, GCS, and HDFS URIs. |
| concurrency | Optional[int] | No | Number of parallel read tasks. Controls the degree of parallelism for reading files across Ray workers. |
| arrow_json_args | **kwargs | No | Additional keyword arguments forwarded to Arrow's JSON parser (e.g., read_options, parse_options, block_size). |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | ray.data.Dataset | A distributed Ray Dataset containing the loaded JSON data as Arrow-backed record batches. Each row corresponds to a JSON object from the input files. |
Usage Examples
Basic Usage
import ray
ray.init(ignore_reinit_error=True, num_cpus=4)
dataset = ray.data.read_json("s3://bucket/data/*.json", concurrency=4)
print(f"Loaded {dataset.count()} rows")
Reading from Local Directory
import ray
ray.init(ignore_reinit_error=True)
# Read all JSON files from a local directory
dataset = ray.data.read_json("/data/events/2026-02-07/")
print(dataset.schema())
print(f"Loaded {dataset.count()} rows from local directory")