Implementation:Datajuicer Data juicer RayExporter Export
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Engineering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for exporting Ray datasets to sharded files in multiple formats provided by the Data-Juicer framework.
Description
The RayExporter class writes Ray datasets to disk using format-specific Ray write methods. It supports JSON, JSONL, Parquet, CSV, TFRecords, WebDataset, and Lance formats. It handles column filtering (removing internal stats/hash columns), shard size configuration, and both local and S3 output paths.
Usage
Used automatically by RayExecutor and PartitionedRayExecutor at the end of pipeline execution. Can also be used directly to export Ray datasets.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/core/ray_exporter.py
- Lines: L12-274 (class), L187-195 (export method)
Signature
class RayExporter:
def __init__(
self,
export_path,
export_type=None,
export_shard_size=0,
keep_stats_in_res_ds=True,
keep_hashes_in_res_ds=False,
**kwargs
):
"""
Args:
export_path: Output directory path.
export_type: Format (json, jsonl, parquet, csv, tfrecords, webdataset, lance).
export_shard_size: Approximate bytes per shard (0 = Ray default).
keep_stats_in_res_ds: Keep __dj__stats__ columns.
keep_hashes_in_res_ds: Keep hash columns.
"""
def export(self, dataset, columns=None) -> None:
"""
Export a Ray dataset.
Args:
dataset: Ray dataset to export.
columns: Specific columns to include (None = all).
"""
Import
from data_juicer.core.ray_exporter import RayExporter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | ray.data.Dataset | Yes | Ray dataset to export |
| export_path | str | Yes (init) | Output directory or S3 path |
| export_type | str | No | Output format (auto-detected from path) |
| columns | list | No | Columns to include in export |
Outputs
| Name | Type | Description |
|---|---|---|
| sharded files | Files on disk | Exported data in specified format, distributed across shards |
Usage Examples
Export Ray Dataset
from data_juicer.core.ray_exporter import RayExporter
exporter = RayExporter(
export_path='./output/',
export_type='parquet',
export_shard_size=256 * 1024 * 1024, # 256MB per shard
keep_stats_in_res_ds=False
)
exporter.export(processed_ray_dataset)
# Creates ./output/part-00000.parquet, part-00001.parquet, ...
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment