Principle:Datajuicer Data juicer Distributed Data Export

Knowledge Sources	Data-Juicer Ray Data
Domains	Distributed_Computing, Data_Engineering
Last Updated	2026-02-14 17:00 GMT

Overview

A distributed serialization pattern that exports Ray datasets to multiple sharded files in various formats using Ray's parallel write capabilities.

Description

Distributed Data Export writes processed Ray datasets to disk using Ray's native parallel write methods. Unlike the single-machine Exporter, it leverages Ray's distributed I/O to write multiple shards concurrently across cluster nodes. It supports a wider range of formats (JSON, JSONL, Parquet, CSV, TFRecords, WebDataset, Lance) and can write to both local filesystems and cloud storage (S3).

Usage

Use this principle as the final step in distributed Ray pipelines to persist processed results. It replaces the standard Exporter when operating in Ray mode.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
# Ray parallel write pattern
ray_dataset.write_{format}(
    path=export_path,
    num_rows_per_file=shard_size,
    # Ray handles parallelism automatically
)

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_RayExporter_Export

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment