Implementation:Huggingface Datasets Info Utils
| Knowledge Sources | |
|---|---|
| Domains | Data_Verification, Integrity |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Utility functions for verifying dataset integrity through checksums and split size validation.
Description
This module provides functions used internally by the dataset builder infrastructure to verify that downloaded files and generated splits match expected metadata. These functions are invoked during the download_and_prepare pipeline when VerificationMode is set to ALL_CHECKS or BASIC_CHECKS.
Note: The VerificationMode enum defined in the same file is documented separately as Huggingface_Datasets_VerificationMode. This page covers the remaining utility functions.
The module provides four key functions:
verify_checksums: Compares expected file checksums against recorded checksums, raising specific exceptions for missing files, unexpected files, or mismatched checksums.verify_splits: Compares expected split metadata against recorded splits, raising exceptions for missing splits, unexpected splits, or mismatched example counts.get_size_checksum_dict: Computes the file size and optionally the SHA-256 checksum for a given file path.is_small_dataset: Checks whether a dataset's size in bytes is below theIN_MEMORY_MAX_SIZEthreshold, which determines whether the dataset should be loaded entirely into memory.
Usage
These functions are used internally by the dataset builder infrastructure and are not typically called directly by end users. They are invoked during dataset preparation to ensure data integrity and to make in-memory loading decisions.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/utils/info_utils.py - Lines: 43-105 (functions documented here; lines 22-40 are
VerificationMode, documented separately)
Signature
def verify_checksums(
expected_checksums: Optional[dict],
recorded_checksums: dict,
verification_name=None,
):
"""Compare expected vs recorded file checksums."""
def verify_splits(
expected_splits: Optional[dict],
recorded_splits: dict,
):
"""Compare expected vs recorded split metadata."""
def get_size_checksum_dict(
path: str,
record_checksum: bool = True,
) -> dict:
"""Compute the file size and the sha256 checksum of a file."""
def is_small_dataset(dataset_size) -> bool:
"""Check if dataset_size is smaller than config.IN_MEMORY_MAX_SIZE."""
Import
from datasets.utils.info_utils import verify_checksums, verify_splits, is_small_dataset
I/O Contract
verify_checksums
| Name | Type | Required | Description |
|---|---|---|---|
| expected_checksums | Optional[dict] |
Yes | Mapping of URL/path to expected checksum dict. If None, verification is skipped with an info log. |
| recorded_checksums | dict |
Yes | Mapping of URL/path to actual recorded checksum dict from the download. |
| verification_name | str |
No | Optional name for the verification context, included in log/error messages. |
Raises:
ExpectedMoreDownloadedFilesError-- if expected files are missing from recorded checksums.UnexpectedDownloadedFileError-- if recorded checksums contain files not in expected set.NonMatchingChecksumError-- if any file's checksum does not match.
verify_splits
| Name | Type | Required | Description |
|---|---|---|---|
| expected_splits | Optional[dict] |
Yes | Mapping of split name to expected split info. If None, verification is skipped with an info log. |
| recorded_splits | dict |
Yes | Mapping of split name to recorded split info from dataset generation. |
Raises:
ExpectedMoreSplitsError-- if expected splits are missing from recorded splits.UnexpectedSplitsError-- if recorded splits contain names not in expected set.NonMatchingSplitsSizesError-- if any split'snum_examplesdoes not match.
get_size_checksum_dict
| Name | Type | Required | Description |
|---|---|---|---|
| path | str |
Yes | Path to the file to compute size and checksum for. |
| record_checksum | bool |
No | Whether to compute the SHA-256 checksum. Defaults to True. If False, checksum is None. |
Returns: dict with keys "num_bytes" (int) and "checksum" (str or None).
is_small_dataset
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_size | int |
Yes | The dataset size in bytes. |
Returns: bool -- True if dataset_size is less than config.IN_MEMORY_MAX_SIZE; False if either value is falsy.
Usage Examples
Verifying Checksums
from datasets.utils.info_utils import verify_checksums
expected = {
"https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}
recorded = {
"https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}
# Passes silently if checksums match
verify_checksums(expected, recorded, verification_name="my_dataset")
Computing File Size and Checksum
from datasets.utils.info_utils import get_size_checksum_dict
result = get_size_checksum_dict("/path/to/file.parquet")
print(result) # {"num_bytes": 52428800, "checksum": "e3b0c44298fc1c14..."}
# Without checksum computation
result = get_size_checksum_dict("/path/to/file.parquet", record_checksum=False)
print(result) # {"num_bytes": 52428800, "checksum": None}
Checking Dataset Size
from datasets.utils.info_utils import is_small_dataset
# Returns True if the dataset fits in memory
if is_small_dataset(1_000_000):
print("Small enough for in-memory loading")
Related Pages
Implements Principle
See Also
- Huggingface_Datasets_VerificationMode -- The
VerificationModeenum defined in the same source file.