Principle:Huggingface Datatrove Tokenized Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Machine Learning Training |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Tokenized Dataset Loading is the principle of efficiently reading pre-tokenized binary data files for language model training, with support for document-aware position tracking and multi-file aggregation.
Description
Language model training requires feeding sequences of token IDs to the model at high throughput. The tokenized dataset loading principle addresses this by reading pre-processed binary files that contain packed token sequences, avoiding the overhead of on-the-fly tokenization during training. The binary format stores tokens as fixed-size integers (2 bytes for vocabularies under 65k, 4 bytes for larger ones) in a flat file, with each training sample being a contiguous window of seq_len + 1 tokens.
A key innovation is document-aware position tracking, which enables models to learn correct position embeddings even when training sequences span multiple documents. Positions reset to 0 at document boundaries, which can be determined either from a companion .index file (containing document end byte offsets) or by detecting end-of-sequence tokens within the token stream.
Usage
Apply this principle when building training data pipelines where tokenization is performed as a preprocessing step (via datatrove's tokenization pipeline) and the training loop reads from the resulting binary files.
Theoretical Basis
The tokenized dataset loading approach is built on several key concepts:
- Fixed-Size Token Windows: The dataset divides the binary file into non-overlapping windows of
(seq_len + 1) * token_sizebytes. The extra token provides the target for next-token prediction. The total number of windows isfile_size / token_size / (seq_len + 1).
- Sequential Access Optimization: The data is pre-shuffled during tokenization, so the dataset is optimized for sequential reads. File handles are kept open between accesses, and the file pointer is advanced by seeking to the correct position. Random access works but incurs seek overhead.
- Document-Aware Position Computation: From the
.indexfile, document end positions within each window are extracted. Positions are computed using a cumulative sum trick:- Initialize a vector of ones (each position increments by 1)
- At each document boundary, set the increment to
prev_end - current_end + 1(which creates a reset to 0) - Take the cumulative sum to produce the final position sequence
- Multi-File Aggregation: For datasets spanning multiple files, cumulative length arrays enable O(log n) index-to-file mapping via binary search (
bisect). Acurrent_filepointer is maintained for fast sequential access within the same file.
- Distributed-Friendly File Caching: A JSON paths file can cache the list of discovered files and their sizes, with file-level locking (via fasteners) to prevent race conditions when multiple training workers simultaneously discover and cache file lists.