Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Data Export

From Leeroopedia
Knowledge Sources
Domains Data_Curation, NLP, Data_Engineering
Last Updated 2026-02-14 17:00 GMT

Overview

Technique for serializing curated text datasets into storage-efficient formats optimized for downstream model training consumption.

Description

Data Export is the final step in a text curation pipeline where processed, filtered, and deduplicated documents are written to persistent storage in formats suitable for training. NeMo Curator supports three export formats: Parquet (columnar storage, efficient for analytical queries), JSONL (line-delimited JSON, human-readable), and Megatron Tokenizer (pre-tokenized binary format with .bin/.idx files for direct consumption by Megatron-LM). The choice of format depends on the downstream training framework and storage/compute constraints.

Usage

Use Parquet export for general-purpose storage and interchange. Use JSONL for human-readable inspection and debugging. Use Megatron Tokenizer format when training with Megatron-LM to avoid tokenization overhead at training time.

Theoretical Basis

Export format selection criteria:

  • Parquet: Best for columnar queries, compression, and schema evolution. Supports predicate pushdown for efficient reading.
  • JSONL: Best for streaming, human readability, and tool interoperability. One JSON object per line.
  • Megatron Binary: Pre-tokenized format with memory-mapped index files. Eliminates tokenization CPU overhead during training.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment