Implementation:Datajuicer Data juicer Exporter Export
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for exporting processed datasets to files in various formats provided by the Data-Juicer framework.
Description
The Exporter class handles serialization of processed datasets to disk. It auto-detects the output format from the file extension or accepts an explicit export_type. It supports JSONL, JSON, and Parquet formats, optional sharding for large datasets, parallel multi-process export, and selective inclusion of statistics/hash columns. It can also export a separate stats file (_stats.jsonl).
Usage
Use this class at the end of any pipeline to persist results. Instantiate with the export path and options from the config, then call export(dataset).
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/core/exporter.py
- Lines: L10-349 (class), L264-271 (export method)
Signature
class Exporter:
def __init__(
self,
export_path,
export_type=None,
export_shard_size=0,
export_in_parallel=True,
num_proc=1,
keep_stats_in_res_ds=True,
keep_hashes_in_res_ds=False,
export_stats=True,
**kwargs
):
"""
Args:
export_path: Output file path (.jsonl, .json, .parquet).
export_type: Explicit format override.
export_shard_size: Bytes per shard (0 = single file).
export_in_parallel: Enable multi-process export.
num_proc: Number of export workers.
keep_stats_in_res_ds: Retain __dj__stats__ columns.
keep_hashes_in_res_ds: Retain hash columns.
export_stats: Export separate stats file.
"""
def export(self, dataset) -> None:
"""
Export the dataset to disk.
Args:
dataset: NestedDataset to export.
"""
Import
from data_juicer.core.exporter import Exporter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | NestedDataset | Yes | Processed dataset to export |
| export_path | str | Yes (init) | Output file path |
| export_shard_size | int | No | Bytes per shard (0 = single file) |
| export_type | str | No | Format override (jsonl, json, parquet) |
Outputs
| Name | Type | Description |
|---|---|---|
| files | Files on disk | Exported dataset in specified format |
| stats_file | File (optional) | _stats.jsonl with per-sample statistics |
Usage Examples
Basic Export
from data_juicer.core.exporter import Exporter
exporter = Exporter(
export_path='./output/cleaned.jsonl',
export_shard_size=0,
keep_stats_in_res_ds=False
)
exporter.export(processed_dataset)
# Creates ./output/cleaned.jsonl
Sharded Parquet Export
from data_juicer.core.exporter import Exporter
exporter = Exporter(
export_path='./output/cleaned.parquet',
export_shard_size=500 * 1024 * 1024, # 500MB per shard
export_in_parallel=True,
num_proc=4
)
exporter.export(processed_dataset)
# Creates ./output/cleaned_0.parquet, cleaned_1.parquet, ...