Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Exporter Export

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, ETL
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for exporting processed datasets to files in various formats provided by the Data-Juicer framework.

Description

The Exporter class handles serialization of processed datasets to disk. It auto-detects the output format from the file extension or accepts an explicit export_type. It supports JSONL, JSON, and Parquet formats, optional sharding for large datasets, parallel multi-process export, and selective inclusion of statistics/hash columns. It can also export a separate stats file (_stats.jsonl).

Usage

Use this class at the end of any pipeline to persist results. Instantiate with the export path and options from the config, then call export(dataset).

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/core/exporter.py
  • Lines: L10-349 (class), L264-271 (export method)

Signature

class Exporter:
    def __init__(
        self,
        export_path,
        export_type=None,
        export_shard_size=0,
        export_in_parallel=True,
        num_proc=1,
        keep_stats_in_res_ds=True,
        keep_hashes_in_res_ds=False,
        export_stats=True,
        **kwargs
    ):
        """
        Args:
            export_path: Output file path (.jsonl, .json, .parquet).
            export_type: Explicit format override.
            export_shard_size: Bytes per shard (0 = single file).
            export_in_parallel: Enable multi-process export.
            num_proc: Number of export workers.
            keep_stats_in_res_ds: Retain __dj__stats__ columns.
            keep_hashes_in_res_ds: Retain hash columns.
            export_stats: Export separate stats file.
        """

    def export(self, dataset) -> None:
        """
        Export the dataset to disk.

        Args:
            dataset: NestedDataset to export.
        """

Import

from data_juicer.core.exporter import Exporter

I/O Contract

Inputs

Name Type Required Description
dataset NestedDataset Yes Processed dataset to export
export_path str Yes (init) Output file path
export_shard_size int No Bytes per shard (0 = single file)
export_type str No Format override (jsonl, json, parquet)

Outputs

Name Type Description
files Files on disk Exported dataset in specified format
stats_file File (optional) _stats.jsonl with per-sample statistics

Usage Examples

Basic Export

from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path='./output/cleaned.jsonl',
    export_shard_size=0,
    keep_stats_in_res_ds=False
)
exporter.export(processed_dataset)
# Creates ./output/cleaned.jsonl

Sharded Parquet Export

from data_juicer.core.exporter import Exporter

exporter = Exporter(
    export_path='./output/cleaned.parquet',
    export_shard_size=500 * 1024 * 1024,  # 500MB per shard
    export_in_parallel=True,
    num_proc=4
)
exporter.export(processed_dataset)
# Creates ./output/cleaned_0.parquet, cleaned_1.parquet, ...

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment