Implementation:Datajuicer Data juicer Exporter Export

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for exporting processed datasets to files in various formats provided by the Data-Juicer framework.

Description

The Exporter class handles serialization of processed datasets to disk. It auto-detects the output format from the file extension or accepts an explicit export_type. It supports JSONL, JSON, and Parquet formats, optional sharding for large datasets, parallel multi-process export, and selective inclusion of statistics/hash columns. It can also export a separate stats file (_stats.jsonl).

Usage

Use this class at the end of any pipeline to persist results. Instantiate with the export path and options from the config, then call export(dataset).

Code Reference

Source Location

Repository: data-juicer
File: data_juicer/core/exporter.py
Lines: L10-349 (class), L264-271 (export method)

Signature

class Exporter:
    def __init__(
        self,
        export_path,
        export_type=None,
        export_shard_size=0,
        export_in_parallel=True,
        num_proc=1,
        keep_stats_in_res_ds=True,
        keep_hashes_in_res_ds=False,
        export_stats=True,
        **kwargs
    ):
        """
        Args:
            export_path: Output file path (.jsonl, .json, .parquet).
            export_type: Explicit format override.
            export_shard_size: Bytes per shard (0 = single file).
            export_in_parallel: Enable multi-process export.
            num_proc: Number of export workers.
            keep_stats_in_res_ds: Retain __dj__stats__ columns.
            keep_hashes_in_res_ds: Retain hash columns.
            export_stats: Export separate stats file.
        """

    def export(self, dataset) -> None:
        """
        Export the dataset to disk.

        Args:
            dataset: NestedDataset to export.
        """

Import

from data_juicer.core.exporter import Exporter

I/O Contract

Inputs

Name	Type	Required	Description
dataset	NestedDataset	Yes	Processed dataset to export
export_path	str	Yes (init)	Output file path
export_shard_size	int	No	Bytes per shard (0 = single file)
export_type	str	No	Format override (jsonl, json, parquet)

Outputs

Name	Type	Description
files	Files on disk	Exported dataset in specified format
stats_file	File (optional)	_stats.jsonl with per-sample statistics

Usage Examples

Basic Export

from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path='./output/cleaned.jsonl',
    export_shard_size=0,
    keep_stats_in_res_ds=False
)
exporter.export(processed_dataset)
# Creates ./output/cleaned.jsonl

Sharded Parquet Export

from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path='./output/cleaned.parquet',
    export_shard_size=500 * 1024 * 1024,  # 500MB per shard
    export_in_parallel=True,
    num_proc=4
)
exporter.export(processed_dataset)
# Creates ./output/cleaned_0.parquet, cleaned_1.parquet, ...

Related Pages

Implements Principle

Principle:Datajuicer_Data_juicer_Data_Export

Requires Environment

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment