Implementation:Huggingface Datatrove TokenDataset

Knowledge Sources	Huggingface_Datatrove
Domains	Data Loading, Machine Learning Training
Last Updated	2026-02-14 17:00 GMT

Overview

DatatroveFileDataset and DatatroveFolderDataset are PyTorch Dataset implementations for reading tokenized data files (.ds format) produced by datatrove's tokenization pipeline, supporting both single-file and multi-file folder access.

Description

DatatroveFileDataset wraps a single .ds binary file containing pre-tokenized training data. It computes the number of full sequence windows from the file size and token size, then reads token chunks via direct file seeking. The class supports optional position tracking through two methods: reading document boundaries from a companion .index file (using a deque-based sequential reader that tracks document end positions), or computing positions from an end-of-sequence token ID within the token stream itself. Position tracking enables document-aware training where position embeddings reset at document boundaries.

DatatroveFolderDataset aggregates multiple DatatroveFileDataset instances from a folder of .ds files. It uses cumulative length arrays and Python's bisect module for O(log n) index-to-file mapping, maintaining a current_file pointer optimized for sequential access. It supports file discovery via glob patterns, optional shuffling of file order, and caching of file paths and sizes to a JSON file (with file-locking support via fasteners) to avoid redundant filesystem operations across multiple workers.

Both classes are conditionally defined only when PyTorch is available and are optimized for sequential reads (the data is pre-shuffled during tokenization), though random access is supported at reduced performance.

Usage

Use these datasets when training language models with PyTorch on data that has been tokenized and saved in datatrove's .ds binary format. They serve as the bridge between datatrove's data processing pipeline and PyTorch DataLoader.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/utils/dataset.py
Lines: 1-352

Signature

class DatatroveFileDataset(Dataset):
    def __init__(
        self,
        file_path: DataFileLike,
        seq_len: int,
        token_size: int = 2,
        return_positions: bool = False,
        positions_from_eos_token_id: int | None = None,
        fsize: int | None = None,
    ):

class DatatroveFolderDataset(Dataset):
    def __init__(
        self,
        data_folder: DataFolderLike,
        seq_len: int,
        filename_pattern: str = ".ds",
        recursive: bool = True,
        token_size: int = 2,
        shuffle: bool = False,
        seed: int = 42,
        return_positions: bool = False,
        positions_from_eos_token_id: int | None = None,
        paths_file: str | None = None,
    ):

Import

from datatrove.utils.dataset import DatatroveFileDataset, DatatroveFolderDataset

I/O Contract

Inputs

Name	Type	Required	Description
file_path	DataFileLike	Yes (FileDataset)	Path to a single .ds file (local, S3, or fsspec-supported)
data_folder	DataFolderLike	Yes (FolderDataset)	Path to folder containing .ds files
seq_len	int	Yes	Sequence length for each training sample
token_size	int	No	Bytes per token: 2 for vocab <65k, 4 for larger (default: 2)
return_positions	bool	No	Whether to return document-aware positions (default: False)
positions_from_eos_token_id	int	No	EOS token ID for position calculation (None uses .index file)
filename_pattern	str	No	Glob pattern for matching files in folder (default: ".ds")
shuffle	bool	No	Shuffle file order within folder (default: False)
seed	int	No	Random seed for shuffling (default: 42)
paths_file	str	No	JSON cache file for file paths and sizes

Outputs

Name	Type	Description
__getitem__ result	dict	Dictionary with 'input_ids' (torch.Tensor) and optionally 'positions' (torch.Tensor)
__len__ result	int	Total number of full sequence windows in the dataset

Usage Examples

Basic Usage

from datatrove.utils.dataset import DatatroveFolderDataset
from torch.utils.data import DataLoader

# Create dataset from a folder of .ds files
dataset = DatatroveFolderDataset(
    data_folder="s3://my-bucket/tokenized-data/",
    seq_len=2048,
    token_size=2,
    return_positions=True,
    shuffle=True,
    seed=42,
)

# Use with PyTorch DataLoader
dataloader = DataLoader(dataset, batch_size=8, num_workers=4)

for batch in dataloader:
    input_ids = batch["input_ids"]      # shape: (8, 2049)
    positions = batch["positions"]      # shape: (8, 2049)
    # ... training step ...

Related Pages

Principle:Huggingface_Datatrove_Tokenized_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment