Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove TokenDataset

From Leeroopedia
Knowledge Sources
Domains Data Loading, Machine Learning Training
Last Updated 2026-02-14 17:00 GMT

Overview

DatatroveFileDataset and DatatroveFolderDataset are PyTorch Dataset implementations for reading tokenized data files (.ds format) produced by datatrove's tokenization pipeline, supporting both single-file and multi-file folder access.

Description

DatatroveFileDataset wraps a single .ds binary file containing pre-tokenized training data. It computes the number of full sequence windows from the file size and token size, then reads token chunks via direct file seeking. The class supports optional position tracking through two methods: reading document boundaries from a companion .index file (using a deque-based sequential reader that tracks document end positions), or computing positions from an end-of-sequence token ID within the token stream itself. Position tracking enables document-aware training where position embeddings reset at document boundaries.

DatatroveFolderDataset aggregates multiple DatatroveFileDataset instances from a folder of .ds files. It uses cumulative length arrays and Python's bisect module for O(log n) index-to-file mapping, maintaining a current_file pointer optimized for sequential access. It supports file discovery via glob patterns, optional shuffling of file order, and caching of file paths and sizes to a JSON file (with file-locking support via fasteners) to avoid redundant filesystem operations across multiple workers.

Both classes are conditionally defined only when PyTorch is available and are optimized for sequential reads (the data is pre-shuffled during tokenization), though random access is supported at reduced performance.

Usage

Use these datasets when training language models with PyTorch on data that has been tokenized and saved in datatrove's .ds binary format. They serve as the bridge between datatrove's data processing pipeline and PyTorch DataLoader.

Code Reference

Source Location

Signature

class DatatroveFileDataset(Dataset):
    def __init__(
        self,
        file_path: DataFileLike,
        seq_len: int,
        token_size: int = 2,
        return_positions: bool = False,
        positions_from_eos_token_id: int | None = None,
        fsize: int | None = None,
    ):

class DatatroveFolderDataset(Dataset):
    def __init__(
        self,
        data_folder: DataFolderLike,
        seq_len: int,
        filename_pattern: str = ".ds",
        recursive: bool = True,
        token_size: int = 2,
        shuffle: bool = False,
        seed: int = 42,
        return_positions: bool = False,
        positions_from_eos_token_id: int | None = None,
        paths_file: str | None = None,
    ):

Import

from datatrove.utils.dataset import DatatroveFileDataset, DatatroveFolderDataset

I/O Contract

Inputs

Name Type Required Description
file_path DataFileLike Yes (FileDataset) Path to a single .ds file (local, S3, or fsspec-supported)
data_folder DataFolderLike Yes (FolderDataset) Path to folder containing .ds files
seq_len int Yes Sequence length for each training sample
token_size int No Bytes per token: 2 for vocab <65k, 4 for larger (default: 2)
return_positions bool No Whether to return document-aware positions (default: False)
positions_from_eos_token_id int No EOS token ID for position calculation (None uses .index file)
filename_pattern str No Glob pattern for matching files in folder (default: ".ds")
shuffle bool No Shuffle file order within folder (default: False)
seed int No Random seed for shuffling (default: 42)
paths_file str No JSON cache file for file paths and sizes

Outputs

Name Type Description
__getitem__ result dict Dictionary with 'input_ids' (torch.Tensor) and optionally 'positions' (torch.Tensor)
__len__ result int Total number of full sequence windows in the dataset

Usage Examples

Basic Usage

from datatrove.utils.dataset import DatatroveFolderDataset
from torch.utils.data import DataLoader

# Create dataset from a folder of .ds files
dataset = DatatroveFolderDataset(
    data_folder="s3://my-bucket/tokenized-data/",
    seq_len=2048,
    token_size=2,
    return_positions=True,
    shuffle=True,
    seed=42,
)

# Use with PyTorch DataLoader
dataloader = DataLoader(dataset, batch_size=8, num_workers=4)

for batch in dataloader:
    input_ids = batch["input_ids"]      # shape: (8, 2049)
    positions = batch["positions"]      # shape: (8, 2049)
    # ... training step ...

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment