Implementation:Huggingface Datatrove MegatronTokenizer

Knowledge Sources	Huggingface_Datatrove
Domains	Tokenization, Data Processing
Last Updated	2026-02-14 17:00 GMT

Overview

MegatronDocumentTokenizer is a pipeline step that tokenizes documents using HuggingFace fast tokenizers and writes the output in NVIDIA Megatron-LM's binary indexed format (`.bin` and `.idx` files).

Description

This module provides two main classes: MegatronTokenizedFile and MegatronDocumentTokenizer. The MegatronTokenizedFile class manages writing tokenized data into Megatron-LM's binary format, producing a `.bin` file containing raw token data and a `.idx` file containing structured metadata including sequence lengths, byte offsets, and document indices. The index file follows a specific binary layout with a 9-byte header (`MMIDIDX\x00\x00`), version information, dtype code, sequence count, document count, and per-sequence metadata.

The MegatronDocumentTokenizer class extends PipelineStepWithTokenizer and orchestrates the end-to-end tokenization process. It reads documents from the pipeline, tokenizes their text content in configurable batches using HuggingFace's fast tokenizer library, and writes the resulting token IDs through a MegatronTokenizedFile instance. The class supports both 2-byte (uint16) and 4-byte (int32) token representations, configurable batch sizes for efficient tokenization, and remote storage via fsspec with adjustable upload block sizes.

A helper function get_output_filename constructs deterministic output filenames based on an optional save filename prefix, the worker rank, and a descriptive name, ensuring unique file names across parallel workers.

Usage

Use this step when you need to prepare tokenized data specifically for training with NVIDIA's Megatron-LM framework. The output format is directly compatible with Megatron-LM's data loading utilities, eliminating the need for format conversion before training.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/tokens/megatron_tokenizer.py
Lines: 1-225

Signature

class MegatronTokenizedFile:
    def __init__(
        self,
        output_folder: DataFolderLike,
        filename: str,
        upload_block_size: int | None = None,
        token_size: int = 2,
    ):

class MegatronDocumentTokenizer(PipelineStepWithTokenizer):
    def __init__(
        self,
        output_folder: DataFolderLike,
        save_filename: str = None,
        tokenizer_name_or_path: str = "gpt2",
        eos_token: str = "<|endoftext|>",
        batch_size: int = 10000,
        upload_block_size: int | None = None,
    ):

Import

from datatrove.pipeline.tokens.megatron_tokenizer import MegatronDocumentTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	The output folder where tokenized files are saved
save_filename	str	No	Custom filename prefix for output files (default: None)
tokenizer_name_or_path	str	No	HuggingFace tokenizer name or local path (default: "gpt2")
eos_token	str	No	endoftext\|>")
batch_size	int	No	Number of documents to tokenize per batch (default: 10000)
upload_block_size	int or None	No	Block size for S3/remote uploads (default: None)

Outputs

Name	Type	Description
.bin file	Binary	Raw token data in Megatron binary format
.idx file	Binary	Index file with header, sequence lengths, byte offsets, and document indices

Usage Examples

Basic Usage

from datatrove.pipeline.tokens.megatron_tokenizer import MegatronDocumentTokenizer

tokenizer = MegatronDocumentTokenizer(
    output_folder="output/megatron_data/",
    tokenizer_name_or_path="gpt2",
    eos_token="<|endoftext|>",
    batch_size=10000,
)

Related Pages

Principle:Huggingface_Datatrove_Megatron_Format_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment