Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Info Utils

From Leeroopedia
Revision as of 12:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Info_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Verification, Integrity
Last Updated 2026-02-14 18:00 GMT

Overview

Utility functions for verifying dataset integrity through checksums and split size validation.

Description

This module provides functions used internally by the dataset builder infrastructure to verify that downloaded files and generated splits match expected metadata. These functions are invoked during the download_and_prepare pipeline when VerificationMode is set to ALL_CHECKS or BASIC_CHECKS.

Note: The VerificationMode enum defined in the same file is documented separately as Huggingface_Datasets_VerificationMode. This page covers the remaining utility functions.

The module provides four key functions:

  • verify_checksums: Compares expected file checksums against recorded checksums, raising specific exceptions for missing files, unexpected files, or mismatched checksums.
  • verify_splits: Compares expected split metadata against recorded splits, raising exceptions for missing splits, unexpected splits, or mismatched example counts.
  • get_size_checksum_dict: Computes the file size and optionally the SHA-256 checksum for a given file path.
  • is_small_dataset: Checks whether a dataset's size in bytes is below the IN_MEMORY_MAX_SIZE threshold, which determines whether the dataset should be loaded entirely into memory.

Usage

These functions are used internally by the dataset builder infrastructure and are not typically called directly by end users. They are invoked during dataset preparation to ensure data integrity and to make in-memory loading decisions.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/utils/info_utils.py
  • Lines: 43-105 (functions documented here; lines 22-40 are VerificationMode, documented separately)

Signature

def verify_checksums(
    expected_checksums: Optional[dict],
    recorded_checksums: dict,
    verification_name=None,
):
    """Compare expected vs recorded file checksums."""

def verify_splits(
    expected_splits: Optional[dict],
    recorded_splits: dict,
):
    """Compare expected vs recorded split metadata."""

def get_size_checksum_dict(
    path: str,
    record_checksum: bool = True,
) -> dict:
    """Compute the file size and the sha256 checksum of a file."""

def is_small_dataset(dataset_size) -> bool:
    """Check if dataset_size is smaller than config.IN_MEMORY_MAX_SIZE."""

Import

from datasets.utils.info_utils import verify_checksums, verify_splits, is_small_dataset

I/O Contract

verify_checksums

Name Type Required Description
expected_checksums Optional[dict] Yes Mapping of URL/path to expected checksum dict. If None, verification is skipped with an info log.
recorded_checksums dict Yes Mapping of URL/path to actual recorded checksum dict from the download.
verification_name str No Optional name for the verification context, included in log/error messages.

Raises:

  • ExpectedMoreDownloadedFilesError -- if expected files are missing from recorded checksums.
  • UnexpectedDownloadedFileError -- if recorded checksums contain files not in expected set.
  • NonMatchingChecksumError -- if any file's checksum does not match.

verify_splits

Name Type Required Description
expected_splits Optional[dict] Yes Mapping of split name to expected split info. If None, verification is skipped with an info log.
recorded_splits dict Yes Mapping of split name to recorded split info from dataset generation.

Raises:

  • ExpectedMoreSplitsError -- if expected splits are missing from recorded splits.
  • UnexpectedSplitsError -- if recorded splits contain names not in expected set.
  • NonMatchingSplitsSizesError -- if any split's num_examples does not match.

get_size_checksum_dict

Name Type Required Description
path str Yes Path to the file to compute size and checksum for.
record_checksum bool No Whether to compute the SHA-256 checksum. Defaults to True. If False, checksum is None.

Returns: dict with keys "num_bytes" (int) and "checksum" (str or None).

is_small_dataset

Name Type Required Description
dataset_size int Yes The dataset size in bytes.

Returns: bool -- True if dataset_size is less than config.IN_MEMORY_MAX_SIZE; False if either value is falsy.

Usage Examples

Verifying Checksums

from datasets.utils.info_utils import verify_checksums

expected = {
    "https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}
recorded = {
    "https://example.com/data.csv": {"num_bytes": 1024, "checksum": "abc123..."},
}

# Passes silently if checksums match
verify_checksums(expected, recorded, verification_name="my_dataset")

Computing File Size and Checksum

from datasets.utils.info_utils import get_size_checksum_dict

result = get_size_checksum_dict("/path/to/file.parquet")
print(result)  # {"num_bytes": 52428800, "checksum": "e3b0c44298fc1c14..."}

# Without checksum computation
result = get_size_checksum_dict("/path/to/file.parquet", record_checksum=False)
print(result)  # {"num_bytes": 52428800, "checksum": None}

Checking Dataset Size

from datasets.utils.info_utils import is_small_dataset

# Returns True if the dataset fits in memory
if is_small_dataset(1_000_000):
    print("Small enough for in-memory loading")

Related Pages

Implements Principle

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment