Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets VerificationMode

From Leeroopedia
Revision as of 13:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_VerificationMode.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for verifying dataset integrity through checksums and split checks, provided by the HuggingFace Datasets library.

Description

VerificationMode is a Python enum.Enum with three members that control the level of integrity checking performed when downloading and generating datasets. It is passed to functions like load_dataset via the verification_mode parameter. The three modes range from no verification (fastest) to full checksum and split validation (most thorough). The default mode is BASIC_CHECKS, which validates splits without computing file checksums.

Usage

Use VerificationMode when you want to explicitly control the level of data integrity checking. Pass it as the verification_mode parameter to load_dataset or dataset builder methods. Use ALL_CHECKS for production pipelines requiring strict integrity, BASIC_CHECKS for general use, and NO_CHECKS for rapid development iteration.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/utils/info_utils.py
  • Lines: 22-40

Signature

class VerificationMode(enum.Enum):
    """Enum that specifies which verification checks to run.

    The default mode is BASIC_CHECKS, which will perform only rudimentary checks
    to avoid slowdowns when generating/downloading a dataset for the first time.

    The verification modes:

    | ALL_CHECKS             | Split checks and validity (number of files, checksums) of downloaded files |
    | BASIC_CHECKS (default) | Same as ALL_CHECKS but without checking downloaded files                   |
    | NO_CHECKS              | None                                                                       |
    """

    ALL_CHECKS = "all_checks"
    BASIC_CHECKS = "basic_checks"
    NO_CHECKS = "no_checks"

Import

from datasets import VerificationMode

I/O Contract

Inputs

Name Type Required Description
value str Yes One of "all_checks", "basic_checks", or "no_checks". Typically accessed via the enum member (e.g. VerificationMode.ALL_CHECKS).

Outputs

Name Type Description
member VerificationMode An enum member representing the selected verification level.

Enum Members

Member Value Description
ALL_CHECKS "all_checks" Validates both downloaded file checksums/sizes and split names/example counts. Most thorough but slowest.
BASIC_CHECKS "basic_checks" Validates split names and example counts only. Skips file checksum verification. This is the default.
NO_CHECKS "no_checks" Skips all verification. Fastest mode, suitable for development or trusted data.

Usage Examples

Basic Usage

from datasets import load_dataset, VerificationMode

# Load with full integrity checks
ds = load_dataset(
    "cornell-movie-review-data/rotten_tomatoes",
    verification_mode=VerificationMode.ALL_CHECKS,
)

Skip Verification for Speed

from datasets import load_dataset, VerificationMode

# Skip all checks during development
ds = load_dataset(
    "cornell-movie-review-data/rotten_tomatoes",
    verification_mode=VerificationMode.NO_CHECKS,
)

Using String Value

from datasets import load_dataset

# String values are also accepted
ds = load_dataset(
    "cornell-movie-review-data/rotten_tomatoes",
    verification_mode="no_checks",
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment