Implementation:Huggingface Datasets Info Utils

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Verification, Integrity
Last Updated	2026-02-14 18:00 GMT

Overview

Utility functions for verifying dataset integrity through checksums and split size validation.

Description

This module provides functions used internally by the dataset builder infrastructure to verify that downloaded files and generated splits match expected metadata. These functions are invoked during the download_and_prepare pipeline when VerificationMode is set to ALL_CHECKS or BASIC_CHECKS.

Note: The VerificationMode enum defined in the same file is documented separately as Huggingface_Datasets_VerificationMode. This page covers the remaining utility functions.

The module provides four key functions:

verify_checksums: Compares expected file checksums against recorded checksums, raising specific exceptions for missing files, unexpected files, or mismatched checksums.
verify_splits: Compares expected split metadata against recorded splits, raising exceptions for missing splits, unexpected splits, or mismatched example counts.
get_size_checksum_dict: Computes the file size and optionally the SHA-256 checksum for a given file path.
is_small_dataset: Checks whether a dataset's size in bytes is below the IN_MEMORY_MAX_SIZE threshold, which determines whether the dataset should be loaded entirely into memory.

Usage

These functions are used internally by the dataset builder infrastructure and are not typically called directly by end users. They are invoked during dataset preparation to ensure data integrity and to make in-memory loading decisions.

Code Reference

Source Location

Repository: datasets
File: src/datasets/utils/info_utils.py
Lines: 43-105 (functions documented here; lines 22-40 are VerificationMode, documented separately)

Signature

def verify_checksums(
    expected_checksums: Optional[dict],
    recorded_checksums: dict,
    verification_name=None,
):
    """Compare expected vs recorded file checksums."""

def verify_splits(
    expected_splits: Optional[dict],
    recorded_splits: dict,
):
    """Compare expected vs recorded split metadata."""

def get_size_checksum_dict(
    path: str,
    record_checksum: bool = True,
) -> dict:
    """Compute the file size and the sha256 checksum of a file."""

def is_small_dataset(dataset_size) -> bool:
    """Check if dataset_size is smaller than config.IN_MEMORY_MAX_SIZE."""

Import

from datasets.utils.info_utils import verify_checksums, verify_splits, is_small_dataset

I/O Contract

`verify_checksums`

Name	Type	Required	Description
expected_checksums	`Optional[dict]`	Yes	Mapping of URL/path to expected checksum dict. If None, verification is skipped with an info log.
recorded_checksums	`dict`	Yes	Mapping of URL/path to actual recorded checksum dict from the download.
verification_name	`str`	No	Optional name for the verification context, included in log/error messages.

Raises:

ExpectedMoreDownloadedFilesError -- if expected files are missing from recorded checksums.
UnexpectedDownloadedFileError -- if recorded checksums contain files not in expected set.
NonMatchingChecksumError -- if any file's checksum does not match.

`verify_splits`

Name	Type	Required	Description
expected_splits	`Optional[dict]`	Yes	Mapping of split name to expected split info. If None, verification is skipped with an info log.
recorded_splits	`dict`	Yes	Mapping of split name to recorded split info from dataset generation.

Raises:

ExpectedMoreSplitsError -- if expected splits are missing from recorded splits.
UnexpectedSplitsError -- if recorded splits contain names not in expected set.
NonMatchingSplitsSizesError -- if any split's num_examples does not match.

`get_size_checksum_dict`

Name	Type	Required	Description
path	`str`	Yes	Path to the file to compute size and checksum for.
record_checksum	`bool`	No	Whether to compute the SHA-256 checksum. Defaults to True. If False, checksum is None.

Returns: dict with keys "num_bytes" (int) and "checksum" (str or None).

`is_small_dataset`

Name	Type	Required	Description
dataset_size	`int`	Yes	The dataset size in bytes.

Returns: bool -- True if dataset_size is less than config.IN_MEMORY_MAX_SIZE; False if either value is falsy.

Usage Examples

Verifying Checksums

from datasets.utils.info_utils import verify_checksums

expected = {
    "https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}
recorded = {
    "https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}

# Passes silently if checksums match
verify_checksums(expected, recorded, verification_name="my_dataset")

Computing File Size and Checksum

from datasets.utils.info_utils import get_size_checksum_dict

result = get_size_checksum_dict("/path/to/file.parquet")
print(result)  # {"num_bytes": 52428800, "checksum": "e3b0c44298fc1c14..."}

# Without checksum computation
result = get_size_checksum_dict("/path/to/file.parquet", record_checksum=False)
print(result)  # {"num_bytes": 52428800, "checksum": None}

Checking Dataset Size

from datasets.utils.info_utils import is_small_dataset

# Returns True if the dataset fits in memory
if is_small_dataset(1_000_000):
    print("Small enough for in-memory loading")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Integrity_Verification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment