Principle:NVIDIA NeMo Curator Data Export

Knowledge Sources	NeMo Curator Apache Parquet
Domains	Data_Curation, NLP, Data_Engineering
Last Updated	2026-02-14 17:00 GMT

Overview

Technique for serializing curated text datasets into storage-efficient formats optimized for downstream model training consumption.

Description

Data Export is the final step in a text curation pipeline where processed, filtered, and deduplicated documents are written to persistent storage in formats suitable for training. NeMo Curator supports three export formats: Parquet (columnar storage, efficient for analytical queries), JSONL (line-delimited JSON, human-readable), and Megatron Tokenizer (pre-tokenized binary format with .bin/.idx files for direct consumption by Megatron-LM). The choice of format depends on the downstream training framework and storage/compute constraints.

Usage

Use Parquet export for general-purpose storage and interchange. Use JSONL for human-readable inspection and debugging. Use Megatron Tokenizer format when training with Megatron-LM to avoid tokenization overhead at training time.

Theoretical Basis

Export format selection criteria:

Parquet: Best for columnar queries, compression, and schema evolution. Supports predicate pushdown for efficient reading.
JSONL: Best for streaming, human readability, and tool interoperability. One JSON object per line.
Megatron Binary: Pre-tokenized format with memory-mapped index files. Eliminates tokenization CPU overhead during training.

Related Pages

Implemented By

Implementation:NVIDIA_NeMo_Curator_ParquetWriter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment