Implementation:Huggingface Datasets TestCommand

Overview

TestCommand is a CLI command class for validating datasets by running download, preparation, and caching operations. It extends BaseDatasetsCLICommand and provides a test subcommand that loads a dataset (optionally all configurations), downloads and prepares it, verifies checksums and splits, and optionally saves dataset info cards or clears the cache after testing. This is used to ensure a dataset builds correctly before publishing.

Source File

Property	Value
Repository	huggingface/datasets
File	src/datasets/commands/test.py
Lines	180
Domain	CLI, Testing

Import

from datasets.commands.test import TestCommand

Class: TestCommand

Inherits from: BaseDatasetsCLICommand

Note: The class sets __test__ = False to prevent pytest from treating it as a test class.

Constructor

def __init__(
    self,
    dataset: str,
    name: str,
    cache_dir: str,
    data_dir: str,
    all_configs: bool,
    save_infos: bool,
    ignore_verifications: bool,
    force_redownload: bool,
    clear_cache: bool,
    num_proc: int,
):

Parameter	Type	Description
`dataset`	`str`	Name or path of the dataset to test
`name`	`str`	Dataset processing configuration name
`cache_dir`	`str`	Directory where datasets are cached
`data_dir`	`str`	Manual directory to load data files from
`all_configs`	`bool`	Whether to test all dataset configurations
`save_infos`	`bool`	Whether to save dataset infos to the dataset card (README.md)
`ignore_verifications`	`bool`	Whether to skip checksums and splits verification
`force_redownload`	`bool`	Whether to force redownloading the dataset
`clear_cache`	`bool`	Whether to clear downloaded files and cache after each config test
`num_proc`	`int`	Number of processes for parallel processing

Validation rules:

If clear_cache is True but cache_dir is not specified, the command exits with an error message.
If save_infos is True, ignore_verifications is automatically set to True.

Methods

register_subcommand(parser)

Static method. Registers the test subcommand with the argument parser. Defines one positional argument (dataset) and multiple optional flags:

@staticmethod
def register_subcommand(parser: ArgumentParser):
    test_parser = parser.add_parser("test", help="Test dataset loading.")
    test_parser.add_argument("--name", type=str, default=None, help="Dataset processing name")
    test_parser.add_argument("--cache_dir", type=str, default=None, help="Cache directory where the datasets are stored.")
    test_parser.add_argument("--data_dir", type=str, default=None, help="Can be used to specify a manual directory to get the files from.")
    test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations")
    test_parser.add_argument("--save_info", action="store_true", help="Save the dataset infos in the dataset card (README.md)")
    test_parser.add_argument("--ignore_verifications", action="store_true", help="Run the test without checksums and splits checks.")
    test_parser.add_argument("--force_redownload", action="store_true", help="Force dataset redownload")
    test_parser.add_argument("--clear_cache", action="store_true", help="Remove downloaded files and cached datasets after each config test")
    test_parser.add_argument("--num_proc", type=int, default=None, help="Number of processes")
    test_parser.add_argument("--save_infos", action="store_true", help="alias to save_info")
    test_parser.add_argument("dataset", type=str, help="Name of the dataset to download")
    test_parser.set_defaults(func=_test_command_factory)

run()

Executes the dataset testing workflow. The method follows these steps:

Validates that --name and --all_configs are not used together.
Loads the dataset module via dataset_module_factory().
Resolves the builder class via get_dataset_builder_class().
Iterates over dataset configurations using an internal get_builders() generator.
For each builder configuration:
- Calls download_and_prepare() with the appropriate download mode and verification mode.
- Calls as_dataset() to verify the dataset loads correctly.
- Optionally saves dataset info to a dataset card (README.md) if save_infos is enabled.
- Optionally clears the cache directory and download folder if clear_cache is enabled.
Prints "Test successful." on completion.

def run(self):
    logging.getLogger("filelock").setLevel(ERROR)
    if self._name is not None and self._all_configs:
        print("Both parameters `config` and `all_configs` can't be used at once.")
        exit(1)
    path, config_name = self._dataset, self._name
    module = dataset_module_factory(path)
    builder_cls = get_dataset_builder_class(module)
    n_builders = len(builder_cls.BUILDER_CONFIGS) if self._all_configs and builder_cls.BUILDER_CONFIGS else 1

    def get_builders() -> Generator[DatasetBuilder, None, None]:
        if self._all_configs and builder_cls.BUILDER_CONFIGS:
            for i, config in enumerate(builder_cls.BUILDER_CONFIGS):
                if "config_name" in module.builder_kwargs:
                    yield builder_cls(
                        cache_dir=self._cache_dir,
                        data_dir=self._data_dir,
                        **module.builder_kwargs,
                    )
                else:
                    yield builder_cls(
                        config_name=config.name,
                        cache_dir=self._cache_dir,
                        data_dir=self._data_dir,
                        **module.builder_kwargs,
                    )
        else:
            if "config_name" in module.builder_kwargs:
                yield builder_cls(cache_dir=self._cache_dir, data_dir=self._data_dir, **module.builder_kwargs)
            else:
                yield builder_cls(
                    config_name=config_name,
                    cache_dir=self._cache_dir,
                    data_dir=self._data_dir,
                    **module.builder_kwargs,
                )

    for j, builder in enumerate(get_builders()):
        print(f"Testing builder '{builder.config.name}' ({j + 1}/{n_builders})")
        builder.download_and_prepare(
            download_mode=DownloadMode.REUSE_CACHE_IF_EXISTS
            if not self._force_redownload
            else DownloadMode.FORCE_REDOWNLOAD,
            verification_mode=VerificationMode.NO_CHECKS
            if self._ignore_verifications
            else VerificationMode.ALL_CHECKS,
            num_proc=self._num_proc,
        )
        builder.as_dataset()
        if self._save_infos:
            save_infos_dir = os.path.basename(path) if not os.path.isdir(path) else path
            os.makedirs(save_infos_dir, exist_ok=True)
            DatasetInfosDict(**{builder.config.name: builder.info}).write_to_directory(save_infos_dir)
        if self._clear_cache:
            if os.path.isdir(builder._cache_dir):
                rmtree(builder._cache_dir)
            download_dir = os.path.join(self._cache_dir, datasets.config.DOWNLOADED_DATASETS_DIR)
            if os.path.isdir(download_dir):
                rmtree(download_dir)

    print("Test successful.")

I/O

Direction	Description
Input	CLI arguments: dataset name/path, optional config name, cache/data directories, boolean flags for testing behavior
Output	Downloads and prepares dataset(s), prints progress and success/failure messages; optionally writes dataset info cards and manages cache

Dependencies

Module	Purpose
`datasets.builder.DatasetBuilder`	Base class for dataset builders
`datasets.commands.BaseDatasetsCLICommand`	Abstract base class for CLI commands
`datasets.download.download_manager.DownloadMode`	Controls download caching behavior
`datasets.info.DatasetInfosDict`	Dataset info serialization
`datasets.load.dataset_module_factory`	Loads the dataset module
`datasets.load.get_dataset_builder_class`	Resolves the builder class from a module
`datasets.utils.info_utils.VerificationMode`	Controls checksum/split verification
`shutil.rmtree`	Cache directory removal

Usage

# Test a specific dataset
datasets-cli test my_dataset

# Test with a specific configuration
datasets-cli test my_dataset --name en

# Test all configurations with forced redownload
datasets-cli test my_dataset --all_configs --force_redownload

# Test and save dataset info
datasets-cli test my_dataset --save_info --cache_dir /tmp/test_cache

# Test with cache cleanup
datasets-cli test my_dataset --clear_cache --cache_dir /tmp/test_cache

# Test with parallel processing
datasets-cli test my_dataset --num_proc 4

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment