Implementation:Huggingface Datatrove CheckDataset
| Knowledge Sources | |
|---|---|
| Domains | Data Validation, Tokenization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
check_dataset is a command-line tool and library function that validates the integrity of tokenized datasets by verifying that EOS tokens appear at document boundaries and that `.ds`, `.ds.index`, and `.ds.loss` files are consistent.
Description
The check_dataset module provides a validation tool for tokenized datasets stored in Datatrove's binary format. It performs two primary checks: first, it verifies that the end-of-sequence (EOS) token is present at the end of each document as indicated by the `.ds.index` file; second, it verifies that the file sizes are consistent across the `.ds` data file, the `.ds.index` index file, and the optional `.ds.loss` loss mask file.
The module includes helper functions load_doc_ends (which reads uint64 document-end positions from an index file) and load_dataset_bytes (which reads tokens one document at a time using the document-end positions). The main check_dataset function iterates through all matching file triples, loads the document ends from the index, reads tokens document-by-document from the binary data file, and asserts that the last token in each document matches the expected EOS token ID (resolved via the specified tokenizer). It also verifies that the data file is fully consumed after reading all documents.
The tool can be invoked from the command line with a path to the dataset folder, an optional tokenizer name, and an optional EOS token string. It supports an optional chunk_size parameter for datasets that are chunked at fixed intervals rather than at document boundaries.
Usage
Use this tool after tokenization to validate that the generated binary dataset files are well-formed before using them for training. This catches corruption, truncation, and misalignment issues early.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/tools/check_dataset.py
- Lines: 1-109
Signature
def check_dataset(
input_folder: DataFolder,
tokenizer: str = "gpt2",
eos_token: str = "<|endoftext|>",
chunk_size: int | None = None,
):
Import
from datatrove.tools.check_dataset import check_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_folder | DataFolder | Yes | Folder containing `.ds`, `.ds.index`, and optionally `.ds.loss` files |
| tokenizer | str | No | Tokenizer name or path for resolving the EOS token ID (default: "gpt2") |
| eos_token | str | No | endoftext|>") |
| chunk_size | int or None | No | If set, allows documents at chunk boundaries to skip the EOS check (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| Validation result | None or AssertionError | Completes silently on success; raises AssertionError on any integrity violation |
Usage Examples
Basic Usage
from datatrove.io import get_datafolder
from datatrove.tools.check_dataset import check_dataset
input_folder = get_datafolder("path/to/tokenized_data/")
check_dataset(input_folder, tokenizer="gpt2", eos_token="<|endoftext|>")
Command Line Usage
python -m datatrove.tools.check_dataset /path/to/tokenized_data -t gpt2 --eos "<|endoftext|>"