Principle:Datajuicer Data juicer Data Export
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A format-aware serialization pattern that writes processed datasets to disk in configurable output formats with optional sharding.
Description
Data Export converts an in-memory processed dataset into persistent file(s) on disk. It supports multiple output formats (JSONL, JSON, Parquet), optional sharding for large datasets, parallel export, and selective column inclusion/exclusion (e.g., stripping internal statistics columns). This is the final step in most pipelines, producing the output artifact that downstream consumers (model training, further processing) will use.
Usage
Use this principle as the final step of any Data-Juicer pipeline to persist processed results. It applies both to standard single-machine pipelines and as the stats export step in analysis workflows.
Theoretical Basis
# Abstract algorithm (NOT real implementation)
# 1. Determine output format from path extension or config
format = detect_format(export_path)
# 2. Optionally strip internal columns
dataset = strip_columns(dataset, keep_stats, keep_hashes)
# 3. Export with optional sharding
if shard_size > 0:
for shard in split(dataset, shard_size):
write(shard, format, export_path)
else:
write(dataset, format, export_path)