Implementation:Huggingface Datatrove CheckDataset

Knowledge Sources	Huggingface_Datatrove
Domains	Data Validation, Tokenization
Last Updated	2026-02-14 17:00 GMT

Overview

check_dataset is a command-line tool and library function that validates the integrity of tokenized datasets by verifying that EOS tokens appear at document boundaries and that `.ds`, `.ds.index`, and `.ds.loss` files are consistent.

Description

The check_dataset module provides a validation tool for tokenized datasets stored in Datatrove's binary format. It performs two primary checks: first, it verifies that the end-of-sequence (EOS) token is present at the end of each document as indicated by the `.ds.index` file; second, it verifies that the file sizes are consistent across the `.ds` data file, the `.ds.index` index file, and the optional `.ds.loss` loss mask file.

The module includes helper functions load_doc_ends (which reads uint64 document-end positions from an index file) and load_dataset_bytes (which reads tokens one document at a time using the document-end positions). The main check_dataset function iterates through all matching file triples, loads the document ends from the index, reads tokens document-by-document from the binary data file, and asserts that the last token in each document matches the expected EOS token ID (resolved via the specified tokenizer). It also verifies that the data file is fully consumed after reading all documents.

The tool can be invoked from the command line with a path to the dataset folder, an optional tokenizer name, and an optional EOS token string. It supports an optional chunk_size parameter for datasets that are chunked at fixed intervals rather than at document boundaries.

Usage

Use this tool after tokenization to validate that the generated binary dataset files are well-formed before using them for training. This catches corruption, truncation, and misalignment issues early.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/tools/check_dataset.py
Lines: 1-109

Signature

def check_dataset(
    input_folder: DataFolder,
    tokenizer: str = "gpt2",
    eos_token: str = "<|endoftext|>",
    chunk_size: int | None = None,
):

Import

from datatrove.tools.check_dataset import check_dataset

I/O Contract

Inputs

Name	Type	Required	Description
input_folder	DataFolder	Yes	Folder containing `.ds`, `.ds.index`, and optionally `.ds.loss` files
tokenizer	str	No	Tokenizer name or path for resolving the EOS token ID (default: "gpt2")
eos_token	str	No	endoftext\|>")
chunk_size	int or None	No	If set, allows documents at chunk boundaries to skip the EOS check (default: None)

Outputs

Name	Type	Description
Validation result	None or AssertionError	Completes silently on success; raises AssertionError on any integrity violation

Usage Examples

Basic Usage

from datatrove.io import get_datafolder
from datatrove.tools.check_dataset import check_dataset

input_folder = get_datafolder("path/to/tokenized_data/")
check_dataset(input_folder, tokenizer="gpt2", eos_token="<|endoftext|>")

Command Line Usage

python -m datatrove.tools.check_dataset /path/to/tokenized_data -t gpt2 --eos "<|endoftext|>"

Related Pages

Principle:Huggingface_Datatrove_Dataset_Integrity_Validation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment