Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets TestCommand

From Leeroopedia

Overview

TestCommand is a CLI command class for validating datasets by running download, preparation, and caching operations. It extends BaseDatasetsCLICommand and provides a test subcommand that loads a dataset (optionally all configurations), downloads and prepares it, verifies checksums and splits, and optionally saves dataset info cards or clears the cache after testing. This is used to ensure a dataset builds correctly before publishing.

Source File

Property Value
Repository huggingface/datasets
File src/datasets/commands/test.py
Lines 180
Domain CLI, Testing

Import

from datasets.commands.test import TestCommand

Class: TestCommand

Inherits from: BaseDatasetsCLICommand

Note: The class sets __test__ = False to prevent pytest from treating it as a test class.

Constructor

def __init__(
    self,
    dataset: str,
    name: str,
    cache_dir: str,
    data_dir: str,
    all_configs: bool,
    save_infos: bool,
    ignore_verifications: bool,
    force_redownload: bool,
    clear_cache: bool,
    num_proc: int,
):
Parameter Type Description
dataset str Name or path of the dataset to test
name str Dataset processing configuration name
cache_dir str Directory where datasets are cached
data_dir str Manual directory to load data files from
all_configs bool Whether to test all dataset configurations
save_infos bool Whether to save dataset infos to the dataset card (README.md)
ignore_verifications bool Whether to skip checksums and splits verification
force_redownload bool Whether to force redownloading the dataset
clear_cache bool Whether to clear downloaded files and cache after each config test
num_proc int Number of processes for parallel processing

Validation rules:

  • If clear_cache is True but cache_dir is not specified, the command exits with an error message.
  • If save_infos is True, ignore_verifications is automatically set to True.

Methods

register_subcommand(parser)

Static method. Registers the test subcommand with the argument parser. Defines one positional argument (dataset) and multiple optional flags:

@staticmethod
def register_subcommand(parser: ArgumentParser):
    test_parser = parser.add_parser("test", help="Test dataset loading.")
    test_parser.add_argument("--name", type=str, default=None, help="Dataset processing name")
    test_parser.add_argument("--cache_dir", type=str, default=None, help="Cache directory where the datasets are stored.")
    test_parser.add_argument("--data_dir", type=str, default=None, help="Can be used to specify a manual directory to get the files from.")
    test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations")
    test_parser.add_argument("--save_info", action="store_true", help="Save the dataset infos in the dataset card (README.md)")
    test_parser.add_argument("--ignore_verifications", action="store_true", help="Run the test without checksums and splits checks.")
    test_parser.add_argument("--force_redownload", action="store_true", help="Force dataset redownload")
    test_parser.add_argument("--clear_cache", action="store_true", help="Remove downloaded files and cached datasets after each config test")
    test_parser.add_argument("--num_proc", type=int, default=None, help="Number of processes")
    test_parser.add_argument("--save_infos", action="store_true", help="alias to save_info")
    test_parser.add_argument("dataset", type=str, help="Name of the dataset to download")
    test_parser.set_defaults(func=_test_command_factory)

run()

Executes the dataset testing workflow. The method follows these steps:

  1. Validates that --name and --all_configs are not used together.
  2. Loads the dataset module via dataset_module_factory().
  3. Resolves the builder class via get_dataset_builder_class().
  4. Iterates over dataset configurations using an internal get_builders() generator.
  5. For each builder configuration:
    • Calls download_and_prepare() with the appropriate download mode and verification mode.
    • Calls as_dataset() to verify the dataset loads correctly.
    • Optionally saves dataset info to a dataset card (README.md) if save_infos is enabled.
    • Optionally clears the cache directory and download folder if clear_cache is enabled.
  6. Prints "Test successful." on completion.
def run(self):
    logging.getLogger("filelock").setLevel(ERROR)
    if self._name is not None and self._all_configs:
        print("Both parameters `config` and `all_configs` can't be used at once.")
        exit(1)
    path, config_name = self._dataset, self._name
    module = dataset_module_factory(path)
    builder_cls = get_dataset_builder_class(module)
    n_builders = len(builder_cls.BUILDER_CONFIGS) if self._all_configs and builder_cls.BUILDER_CONFIGS else 1

    def get_builders() -> Generator[DatasetBuilder, None, None]:
        if self._all_configs and builder_cls.BUILDER_CONFIGS:
            for i, config in enumerate(builder_cls.BUILDER_CONFIGS):
                if "config_name" in module.builder_kwargs:
                    yield builder_cls(
                        cache_dir=self._cache_dir,
                        data_dir=self._data_dir,
                        **module.builder_kwargs,
                    )
                else:
                    yield builder_cls(
                        config_name=config.name,
                        cache_dir=self._cache_dir,
                        data_dir=self._data_dir,
                        **module.builder_kwargs,
                    )
        else:
            if "config_name" in module.builder_kwargs:
                yield builder_cls(cache_dir=self._cache_dir, data_dir=self._data_dir, **module.builder_kwargs)
            else:
                yield builder_cls(
                    config_name=config_name,
                    cache_dir=self._cache_dir,
                    data_dir=self._data_dir,
                    **module.builder_kwargs,
                )

    for j, builder in enumerate(get_builders()):
        print(f"Testing builder '{builder.config.name}' ({j + 1}/{n_builders})")
        builder.download_and_prepare(
            download_mode=DownloadMode.REUSE_CACHE_IF_EXISTS
            if not self._force_redownload
            else DownloadMode.FORCE_REDOWNLOAD,
            verification_mode=VerificationMode.NO_CHECKS
            if self._ignore_verifications
            else VerificationMode.ALL_CHECKS,
            num_proc=self._num_proc,
        )
        builder.as_dataset()
        if self._save_infos:
            save_infos_dir = os.path.basename(path) if not os.path.isdir(path) else path
            os.makedirs(save_infos_dir, exist_ok=True)
            DatasetInfosDict(**{builder.config.name: builder.info}).write_to_directory(save_infos_dir)
        if self._clear_cache:
            if os.path.isdir(builder._cache_dir):
                rmtree(builder._cache_dir)
            download_dir = os.path.join(self._cache_dir, datasets.config.DOWNLOADED_DATASETS_DIR)
            if os.path.isdir(download_dir):
                rmtree(download_dir)

    print("Test successful.")

I/O

Direction Description
Input CLI arguments: dataset name/path, optional config name, cache/data directories, boolean flags for testing behavior
Output Downloads and prepares dataset(s), prints progress and success/failure messages; optionally writes dataset info cards and manages cache

Dependencies

Module Purpose
datasets.builder.DatasetBuilder Base class for dataset builders
datasets.commands.BaseDatasetsCLICommand Abstract base class for CLI commands
datasets.download.download_manager.DownloadMode Controls download caching behavior
datasets.info.DatasetInfosDict Dataset info serialization
datasets.load.dataset_module_factory Loads the dataset module
datasets.load.get_dataset_builder_class Resolves the builder class from a module
datasets.utils.info_utils.VerificationMode Controls checksum/split verification
shutil.rmtree Cache directory removal

Usage

# Test a specific dataset
datasets-cli test my_dataset

# Test with a specific configuration
datasets-cli test my_dataset --name en

# Test all configurations with forced redownload
datasets-cli test my_dataset --all_configs --force_redownload

# Test and save dataset info
datasets-cli test my_dataset --save_info --cache_dir /tmp/test_cache

# Test with cache cleanup
datasets-cli test my_dataset --clear_cache --cache_dir /tmp/test_cache

# Test with parallel processing
datasets-cli test my_dataset --num_proc 4

Related Pages

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment