Principle:Huggingface Datasets Data Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Verifying dataset integrity through checksums and split checks ensures that downloaded and generated data matches its expected specification.
Description
When datasets are downloaded from remote sources, multiple failure modes can compromise data integrity: partial downloads, corrupted transfers, mismatched file versions, or unintended modifications. Data Verification provides a systematic mechanism to detect these issues by comparing observed properties of the data against expected values recorded in the dataset's metadata.
The verification system operates at two levels:
- Download verification: Compares SHA-256 checksums and file sizes of downloaded files against expected values. This catches corrupted downloads, truncated files, and version mismatches.
- Split verification: Validates that the generated splits match expected split names and example counts. This catches issues in data generation scripts, such as missing splits or incorrect row counts.
The library provides configurable verification levels so users can balance thoroughness against speed. Full verification is most important for first-time dataset generation, while reduced verification is acceptable for cached datasets that have already been validated.
Usage
Use Data Verification when:
- You are downloading a dataset for the first time and want to ensure the download is complete and uncorrupted.
- You are generating dataset splits and want to validate the output against expected metadata.
- You are building a reproducible pipeline where data integrity is critical.
- You need to skip verification for speed in development or when working with datasets that lack pre-computed checksums.
Theoretical Basis
Verification operates as a three-tier system:
- ALL_CHECKS: Performs split verification (correct split names and example counts) and download verification (file checksums and sizes). This is the most thorough but slowest mode.
- BASIC_CHECKS: Performs split verification only, skipping the computationally expensive checksum validation of downloaded files. This is the default mode.
- NO_CHECKS: Skips all verification. Useful for development iteration or when working with trusted local data.
Pseudocode:
if mode == ALL_CHECKS:
verify_checksums(expected_checksums, recorded_checksums)
verify_splits(expected_splits, recorded_splits)
elif mode == BASIC_CHECKS:
verify_splits(expected_splits, recorded_splits)
elif mode == NO_CHECKS:
pass # skip all verification
Each verification step raises a specific exception type on failure (NonMatchingChecksumError, NonMatchingSplitsSizesError, etc.), enabling precise error handling.