Workflow:Huggingface Datatrove Dataset Tokenization
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, LLM_Training |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Two-stage pipeline for tokenizing text datasets into binary token files and merging them into a consolidated training-ready format.
Description
This workflow converts cleaned text data into tokenized binary files suitable for LLM training. It operates in two phases: first, a distributed tokenization step where each parallel task reads a shard of input documents, tokenizes them using a HuggingFace tokenizer, shuffles the results, and writes per-task binary output files. Second, a single-task merge step reads all per-task tokenized files and combines them into a final consolidated binary dataset with optional context-window-level shuffling. The binary output format (.ds files) supports efficient random-access reading during training.
Usage
Execute this workflow after data cleaning and deduplication when you need to prepare a text corpus for LLM pretraining. The input can be any readable data source (JSONL, Parquet, HuggingFace Hub datasets). The output is a binary tokenized dataset ready for consumption by training frameworks.
Execution Steps
Step 1: Read Source Data
Load the input dataset from local storage, S3, or directly from the HuggingFace Hub. The reader distributes input files across parallel tasks so each worker tokenizes a non-overlapping shard. Supports JSONL, Parquet, CSV, and HuggingFace dataset formats.
Key considerations:
- HuggingFace datasets can be read directly from the Hub using the hf:// protocol
- Configure the text_key to match the column containing the text to tokenize
- Number of tasks should not exceed the number of input files
Step 2: Distributed Tokenization
Each task tokenizes its shard of documents using the specified HuggingFace tokenizer. Documents are encoded into token IDs, separated by EOS tokens, and written to binary .ds files. Each task also produces a local index mapping document boundaries within the binary file. Within-task shuffling is applied to randomize document order.
Key considerations:
- Uses a local working directory for fast I/O during tokenization, then uploads to final storage
- EOS tokens mark document boundaries in the binary stream
- Each task produces its own binary file, indexed for random access
Step 3: Merge and Shuffle
A single merge task reads all per-task tokenized files and combines them into one or more final binary files with configurable maximum token limits per output file. The merger applies context-window-level shuffling by randomly sampling from different source files, producing a well-shuffled training dataset.
Key considerations:
- Must run as a single task to produce a coherent merged output
- Context window shuffling improves training data randomization
- Output files are sized for efficient data loading during training
- This stage depends on the distributed tokenization completing first