Implementation:Huggingface Datatrove MegatronTokenizer
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Data Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
MegatronDocumentTokenizer is a pipeline step that tokenizes documents using HuggingFace fast tokenizers and writes the output in NVIDIA Megatron-LM's binary indexed format (`.bin` and `.idx` files).
Description
This module provides two main classes: MegatronTokenizedFile and MegatronDocumentTokenizer. The MegatronTokenizedFile class manages writing tokenized data into Megatron-LM's binary format, producing a `.bin` file containing raw token data and a `.idx` file containing structured metadata including sequence lengths, byte offsets, and document indices. The index file follows a specific binary layout with a 9-byte header (`MMIDIDX\x00\x00`), version information, dtype code, sequence count, document count, and per-sequence metadata.
The MegatronDocumentTokenizer class extends PipelineStepWithTokenizer and orchestrates the end-to-end tokenization process. It reads documents from the pipeline, tokenizes their text content in configurable batches using HuggingFace's fast tokenizer library, and writes the resulting token IDs through a MegatronTokenizedFile instance. The class supports both 2-byte (uint16) and 4-byte (int32) token representations, configurable batch sizes for efficient tokenization, and remote storage via fsspec with adjustable upload block sizes.
A helper function get_output_filename constructs deterministic output filenames based on an optional save filename prefix, the worker rank, and a descriptive name, ensuring unique file names across parallel workers.
Usage
Use this step when you need to prepare tokenized data specifically for training with NVIDIA's Megatron-LM framework. The output format is directly compatible with Megatron-LM's data loading utilities, eliminating the need for format conversion before training.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/tokens/megatron_tokenizer.py
- Lines: 1-225
Signature
class MegatronTokenizedFile:
def __init__(
self,
output_folder: DataFolderLike,
filename: str,
upload_block_size: int | None = None,
token_size: int = 2,
):
class MegatronDocumentTokenizer(PipelineStepWithTokenizer):
def __init__(
self,
output_folder: DataFolderLike,
save_filename: str = None,
tokenizer_name_or_path: str = "gpt2",
eos_token: str = "<|endoftext|>",
batch_size: int = 10000,
upload_block_size: int | None = None,
):
Import
from datatrove.pipeline.tokens.megatron_tokenizer import MegatronDocumentTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | The output folder where tokenized files are saved |
| save_filename | str | No | Custom filename prefix for output files (default: None) |
| tokenizer_name_or_path | str | No | HuggingFace tokenizer name or local path (default: "gpt2") |
| eos_token | str | No | endoftext|>") |
| batch_size | int | No | Number of documents to tokenize per batch (default: 10000) |
| upload_block_size | int or None | No | Block size for S3/remote uploads (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| .bin file | Binary | Raw token data in Megatron binary format |
| .idx file | Binary | Index file with header, sequence lengths, byte offsets, and document indices |
Usage Examples
Basic Usage
from datatrove.pipeline.tokens.megatron_tokenizer import MegatronDocumentTokenizer
tokenizer = MegatronDocumentTokenizer(
output_folder="output/megatron_data/",
tokenizer_name_or_path="gpt2",
eos_token="<|endoftext|>",
batch_size=10000,
)