Implementation:Huggingface Datatrove TokenDataset
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Machine Learning Training |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
DatatroveFileDataset and DatatroveFolderDataset are PyTorch Dataset implementations for reading tokenized data files (.ds format) produced by datatrove's tokenization pipeline, supporting both single-file and multi-file folder access.
Description
DatatroveFileDataset wraps a single .ds binary file containing pre-tokenized training data. It computes the number of full sequence windows from the file size and token size, then reads token chunks via direct file seeking. The class supports optional position tracking through two methods: reading document boundaries from a companion .index file (using a deque-based sequential reader that tracks document end positions), or computing positions from an end-of-sequence token ID within the token stream itself. Position tracking enables document-aware training where position embeddings reset at document boundaries.
DatatroveFolderDataset aggregates multiple DatatroveFileDataset instances from a folder of .ds files. It uses cumulative length arrays and Python's bisect module for O(log n) index-to-file mapping, maintaining a current_file pointer optimized for sequential access. It supports file discovery via glob patterns, optional shuffling of file order, and caching of file paths and sizes to a JSON file (with file-locking support via fasteners) to avoid redundant filesystem operations across multiple workers.
Both classes are conditionally defined only when PyTorch is available and are optimized for sequential reads (the data is pre-shuffled during tokenization), though random access is supported at reduced performance.
Usage
Use these datasets when training language models with PyTorch on data that has been tokenized and saved in datatrove's .ds binary format. They serve as the bridge between datatrove's data processing pipeline and PyTorch DataLoader.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/dataset.py
- Lines: 1-352
Signature
class DatatroveFileDataset(Dataset):
def __init__(
self,
file_path: DataFileLike,
seq_len: int,
token_size: int = 2,
return_positions: bool = False,
positions_from_eos_token_id: int | None = None,
fsize: int | None = None,
):
class DatatroveFolderDataset(Dataset):
def __init__(
self,
data_folder: DataFolderLike,
seq_len: int,
filename_pattern: str = ".ds",
recursive: bool = True,
token_size: int = 2,
shuffle: bool = False,
seed: int = 42,
return_positions: bool = False,
positions_from_eos_token_id: int | None = None,
paths_file: str | None = None,
):
Import
from datatrove.utils.dataset import DatatroveFileDataset, DatatroveFolderDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_path | DataFileLike | Yes (FileDataset) | Path to a single .ds file (local, S3, or fsspec-supported) |
| data_folder | DataFolderLike | Yes (FolderDataset) | Path to folder containing .ds files |
| seq_len | int | Yes | Sequence length for each training sample |
| token_size | int | No | Bytes per token: 2 for vocab <65k, 4 for larger (default: 2) |
| return_positions | bool | No | Whether to return document-aware positions (default: False) |
| positions_from_eos_token_id | int | No | EOS token ID for position calculation (None uses .index file) |
| filename_pattern | str | No | Glob pattern for matching files in folder (default: ".ds") |
| shuffle | bool | No | Shuffle file order within folder (default: False) |
| seed | int | No | Random seed for shuffling (default: 42) |
| paths_file | str | No | JSON cache file for file paths and sizes |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ result | dict | Dictionary with 'input_ids' (torch.Tensor) and optionally 'positions' (torch.Tensor) |
| __len__ result | int | Total number of full sequence windows in the dataset |
Usage Examples
Basic Usage
from datatrove.utils.dataset import DatatroveFolderDataset
from torch.utils.data import DataLoader
# Create dataset from a folder of .ds files
dataset = DatatroveFolderDataset(
data_folder="s3://my-bucket/tokenized-data/",
seq_len=2048,
token_size=2,
return_positions=True,
shuffle=True,
seed=42,
)
# Use with PyTorch DataLoader
dataloader = DataLoader(dataset, batch_size=8, num_workers=4)
for batch in dataloader:
input_ids = batch["input_ids"] # shape: (8, 2049)
positions = batch["positions"] # shape: (8, 2049)
# ... training step ...