Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer RayExporter Export

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Data_Engineering
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for exporting Ray datasets to sharded files in multiple formats provided by the Data-Juicer framework.

Description

The RayExporter class writes Ray datasets to disk using format-specific Ray write methods. It supports JSON, JSONL, Parquet, CSV, TFRecords, WebDataset, and Lance formats. It handles column filtering (removing internal stats/hash columns), shard size configuration, and both local and S3 output paths.

Usage

Used automatically by RayExecutor and PartitionedRayExecutor at the end of pipeline execution. Can also be used directly to export Ray datasets.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/core/ray_exporter.py
  • Lines: L12-274 (class), L187-195 (export method)

Signature

class RayExporter:
    def __init__(
        self,
        export_path,
        export_type=None,
        export_shard_size=0,
        keep_stats_in_res_ds=True,
        keep_hashes_in_res_ds=False,
        **kwargs
    ):
        """
        Args:
            export_path: Output directory path.
            export_type: Format (json, jsonl, parquet, csv, tfrecords, webdataset, lance).
            export_shard_size: Approximate bytes per shard (0 = Ray default).
            keep_stats_in_res_ds: Keep __dj__stats__ columns.
            keep_hashes_in_res_ds: Keep hash columns.
        """

    def export(self, dataset, columns=None) -> None:
        """
        Export a Ray dataset.

        Args:
            dataset: Ray dataset to export.
            columns: Specific columns to include (None = all).
        """

Import

from data_juicer.core.ray_exporter import RayExporter

I/O Contract

Inputs

Name Type Required Description
dataset ray.data.Dataset Yes Ray dataset to export
export_path str Yes (init) Output directory or S3 path
export_type str No Output format (auto-detected from path)
columns list No Columns to include in export

Outputs

Name Type Description
sharded files Files on disk Exported data in specified format, distributed across shards

Usage Examples

Export Ray Dataset

from data_juicer.core.ray_exporter import RayExporter

exporter = RayExporter(
    export_path='./output/',
    export_type='parquet',
    export_shard_size=256 * 1024 * 1024,  # 256MB per shard
    keep_stats_in_res_ds=False
)
exporter.export(processed_ray_dataset)
# Creates ./output/part-00000.parquet, part-00001.parquet, ...

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment