Principle:Huggingface Datatrove Megatron Format Tokenization
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Deep Learning Infrastructure |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Megatron format tokenization is the process of converting text documents into NVIDIA Megatron-LM's binary indexed dataset format, consisting of paired `.bin` and `.idx` files that enable efficient random-access data loading during large-scale language model training.
Description
NVIDIA's Megatron-LM framework uses a specialized binary dataset format for efficient data loading during distributed training. This format separates the token data (stored in a `.bin` file as a flat array of token IDs) from the metadata (stored in a `.idx` file with structured binary headers). The index file enables the training framework to quickly locate any document or sequence within the binary data file without scanning the entire corpus.
The principle emphasizes separation of data and metadata for performance. By storing token IDs contiguously in the `.bin` file and maintaining a separate index, the training dataloader can perform random access at the sequence level using precomputed byte offsets, which is critical for efficient shuffling and sharding across thousands of GPUs.
Usage
Apply Megatron format tokenization when preparing training data for NVIDIA Megatron-LM or compatible frameworks (such as NeMo). This format is the standard for large-scale distributed language model pre-training on NVIDIA hardware.
Theoretical Basis
The Megatron indexed dataset format is designed around several key principles:
Binary efficiency: Token IDs are stored as fixed-width integers (typically 2-byte uint16 or 4-byte int32), enabling direct memory mapping and eliminating parsing overhead. The dtype is encoded in the index file so the reader knows how to interpret the binary data.
Structured index layout: The `.idx` file follows a precise binary structure: a 9-byte magic header (`MMIDIDX\x00\x00`), an 8-byte version number, a 1-byte dtype code, 8-byte counts for sequences and documents, followed by arrays of per-sequence lengths (4 bytes each), per-sequence byte offsets (8 bytes each), and per-document sequence indices (8 bytes each). This fixed layout allows O(1) access to any sequence's location.
Document boundary tracking: The format maintains explicit document indices that map sequences back to their source documents. This is essential for training techniques that respect document boundaries, such as preventing attention from crossing document boundaries.
Scalability: The format supports very large corpora (billions of tokens) across multiple sharded files, with each shard independently addressable. The use of 64-bit pointers and counts ensures the format does not impose practical size limits on individual files.