Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove CheckDataset

From Leeroopedia
Knowledge Sources
Domains Data Validation, Tokenization
Last Updated 2026-02-14 17:00 GMT

Overview

check_dataset is a command-line tool and library function that validates the integrity of tokenized datasets by verifying that EOS tokens appear at document boundaries and that `.ds`, `.ds.index`, and `.ds.loss` files are consistent.

Description

The check_dataset module provides a validation tool for tokenized datasets stored in Datatrove's binary format. It performs two primary checks: first, it verifies that the end-of-sequence (EOS) token is present at the end of each document as indicated by the `.ds.index` file; second, it verifies that the file sizes are consistent across the `.ds` data file, the `.ds.index` index file, and the optional `.ds.loss` loss mask file.

The module includes helper functions load_doc_ends (which reads uint64 document-end positions from an index file) and load_dataset_bytes (which reads tokens one document at a time using the document-end positions). The main check_dataset function iterates through all matching file triples, loads the document ends from the index, reads tokens document-by-document from the binary data file, and asserts that the last token in each document matches the expected EOS token ID (resolved via the specified tokenizer). It also verifies that the data file is fully consumed after reading all documents.

The tool can be invoked from the command line with a path to the dataset folder, an optional tokenizer name, and an optional EOS token string. It supports an optional chunk_size parameter for datasets that are chunked at fixed intervals rather than at document boundaries.

Usage

Use this tool after tokenization to validate that the generated binary dataset files are well-formed before using them for training. This catches corruption, truncation, and misalignment issues early.

Code Reference

Source Location

Signature

def check_dataset(
    input_folder: DataFolder,
    tokenizer: str = "gpt2",
    eos_token: str = "<|endoftext|>",
    chunk_size: int | None = None,
):

Import

from datatrove.tools.check_dataset import check_dataset

I/O Contract

Inputs

Name Type Required Description
input_folder DataFolder Yes Folder containing `.ds`, `.ds.index`, and optionally `.ds.loss` files
tokenizer str No Tokenizer name or path for resolving the EOS token ID (default: "gpt2")
eos_token str No endoftext|>")
chunk_size int or None No If set, allows documents at chunk boundaries to skip the EOS check (default: None)

Outputs

Name Type Description
Validation result None or AssertionError Completes silently on success; raises AssertionError on any integrity violation

Usage Examples

Basic Usage

from datatrove.io import get_datafolder
from datatrove.tools.check_dataset import check_dataset

input_folder = get_datafolder("path/to/tokenized_data/")
check_dataset(input_folder, tokenizer="gpt2", eos_token="<|endoftext|>")

Command Line Usage

python -m datatrove.tools.check_dataset /path/to/tokenized_data -t gpt2 --eos "<|endoftext|>"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment