Principle:Datajuicer Data juicer Data Export

Knowledge Sources	Data-Juicer
Domains	Data_Engineering, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

A format-aware serialization pattern that writes processed datasets to disk in configurable output formats with optional sharding.

Description

Data Export converts an in-memory processed dataset into persistent file(s) on disk. It supports multiple output formats (JSONL, JSON, Parquet), optional sharding for large datasets, parallel export, and selective column inclusion/exclusion (e.g., stripping internal statistics columns). This is the final step in most pipelines, producing the output artifact that downstream consumers (model training, further processing) will use.

Usage

Use this principle as the final step of any Data-Juicer pipeline to persist processed results. It applies both to standard single-machine pipelines and as the stats export step in analysis workflows.

Theoretical Basis

# Abstract algorithm (NOT real implementation)
# 1. Determine output format from path extension or config
format = detect_format(export_path)

# 2. Optionally strip internal columns
dataset = strip_columns(dataset, keep_stats, keep_hashes)

# 3. Export with optional sharding
if shard_size > 0:
    for shard in split(dataset, shard_size):
        write(shard, format, export_path)
else:
    write(dataset, format, export_path)

Related Pages

Implemented By

Implementation:Datajuicer_Data_juicer_Exporter_Export

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment