Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datatrove Tokenized Dataset Loading

From Leeroopedia
Knowledge Sources
Domains Data Loading, Machine Learning Training
Last Updated 2026-02-14 17:00 GMT

Overview

Tokenized Dataset Loading is the principle of efficiently reading pre-tokenized binary data files for language model training, with support for document-aware position tracking and multi-file aggregation.

Description

Language model training requires feeding sequences of token IDs to the model at high throughput. The tokenized dataset loading principle addresses this by reading pre-processed binary files that contain packed token sequences, avoiding the overhead of on-the-fly tokenization during training. The binary format stores tokens as fixed-size integers (2 bytes for vocabularies under 65k, 4 bytes for larger ones) in a flat file, with each training sample being a contiguous window of seq_len + 1 tokens.

A key innovation is document-aware position tracking, which enables models to learn correct position embeddings even when training sequences span multiple documents. Positions reset to 0 at document boundaries, which can be determined either from a companion .index file (containing document end byte offsets) or by detecting end-of-sequence tokens within the token stream.

Usage

Apply this principle when building training data pipelines where tokenization is performed as a preprocessing step (via datatrove's tokenization pipeline) and the training loop reads from the resulting binary files.

Theoretical Basis

The tokenized dataset loading approach is built on several key concepts:

  • Fixed-Size Token Windows: The dataset divides the binary file into non-overlapping windows of (seq_len + 1) * token_size bytes. The extra token provides the target for next-token prediction. The total number of windows is file_size / token_size / (seq_len + 1).
  • Sequential Access Optimization: The data is pre-shuffled during tokenization, so the dataset is optimized for sequential reads. File handles are kept open between accesses, and the file pointer is advanced by seeking to the correct position. Random access works but incurs seek overhead.
  • Document-Aware Position Computation: From the .index file, document end positions within each window are extracted. Positions are computed using a cumulative sum trick:
    • Initialize a vector of ones (each position increments by 1)
    • At each document boundary, set the increment to prev_end - current_end + 1 (which creates a reset to 0)
    • Take the cumulative sum to produce the final position sequence
  • Multi-File Aggregation: For datasets spanning multiple files, cumulative length arrays enable O(log n) index-to-file mapping via binary search (bisect). A current_file pointer is maintained for fast sequential access within the same file.
  • Distributed-Friendly File Caching: A JSON paths file can cache the list of discovered files and their sizes, with file-level locking (via fasteners) to prevent race conditions when multiple training workers simultaneously discover and cache file lists.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment