Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Datatrove Dataset Tokenization

From Leeroopedia
Revision as of 11:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Huggingface_Datatrove_Dataset_Tokenization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP, LLM_Training
Last Updated 2026-02-14 17:00 GMT

Overview

Two-stage pipeline for tokenizing text datasets into binary token files and merging them into a consolidated training-ready format.

Description

This workflow converts cleaned text data into tokenized binary files suitable for LLM training. It operates in two phases: first, a distributed tokenization step where each parallel task reads a shard of input documents, tokenizes them using a HuggingFace tokenizer, shuffles the results, and writes per-task binary output files. Second, a single-task merge step reads all per-task tokenized files and combines them into a final consolidated binary dataset with optional context-window-level shuffling. The binary output format (.ds files) supports efficient random-access reading during training.

Usage

Execute this workflow after data cleaning and deduplication when you need to prepare a text corpus for LLM pretraining. The input can be any readable data source (JSONL, Parquet, HuggingFace Hub datasets). The output is a binary tokenized dataset ready for consumption by training frameworks.

Execution Steps

Step 1: Read Source Data

Load the input dataset from local storage, S3, or directly from the HuggingFace Hub. The reader distributes input files across parallel tasks so each worker tokenizes a non-overlapping shard. Supports JSONL, Parquet, CSV, and HuggingFace dataset formats.

Key considerations:

  • HuggingFace datasets can be read directly from the Hub using the hf:// protocol
  • Configure the text_key to match the column containing the text to tokenize
  • Number of tasks should not exceed the number of input files

Step 2: Distributed Tokenization

Each task tokenizes its shard of documents using the specified HuggingFace tokenizer. Documents are encoded into token IDs, separated by EOS tokens, and written to binary .ds files. Each task also produces a local index mapping document boundaries within the binary file. Within-task shuffling is applied to randomize document order.

Key considerations:

  • Uses a local working directory for fast I/O during tokenization, then uploads to final storage
  • EOS tokens mark document boundaries in the binary stream
  • Each task produces its own binary file, indexed for random access

Step 3: Merge and Shuffle

A single merge task reads all per-task tokenized files and combines them into one or more final binary files with configurable maximum token limits per output file. The merger applies context-window-level shuffling by randomly sampling from different source files, producing a well-shuffled training dataset.

Key considerations:

  • Must run as a single task to produce a coherent merged output
  • Context window shuffling improves training data randomization
  • Output files are sized for efficient data loading during training
  • This stage depends on the distributed tokenization completing first

Execution Diagram

GitHub URL

Workflow Repository