Principle:NVIDIA NeMo Curator Data Export
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, NLP, Data_Engineering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Technique for serializing curated text datasets into storage-efficient formats optimized for downstream model training consumption.
Description
Data Export is the final step in a text curation pipeline where processed, filtered, and deduplicated documents are written to persistent storage in formats suitable for training. NeMo Curator supports three export formats: Parquet (columnar storage, efficient for analytical queries), JSONL (line-delimited JSON, human-readable), and Megatron Tokenizer (pre-tokenized binary format with .bin/.idx files for direct consumption by Megatron-LM). The choice of format depends on the downstream training framework and storage/compute constraints.
Usage
Use Parquet export for general-purpose storage and interchange. Use JSONL for human-readable inspection and debugging. Use Megatron Tokenizer format when training with Megatron-LM to avoid tokenization overhead at training time.
Theoretical Basis
Export format selection criteria:
- Parquet: Best for columnar queries, compression, and schema evolution. Supports predicate pushdown for efficient reading.
- JSONL: Best for streaming, human readability, and tool interoperability. One JSON object per line.
- Megatron Binary: Pre-tokenized format with memory-mapped index files. Eliminates tokenization CPU overhead during training.