Implementation:Huggingface Datasets TestCommand
Overview
TestCommand is a CLI command class for validating datasets by running download, preparation, and caching operations. It extends BaseDatasetsCLICommand and provides a test subcommand that loads a dataset (optionally all configurations), downloads and prepares it, verifies checksums and splits, and optionally saves dataset info cards or clears the cache after testing. This is used to ensure a dataset builds correctly before publishing.
Source File
| Property | Value |
|---|---|
| Repository | huggingface/datasets |
| File | src/datasets/commands/test.py |
| Lines | 180 |
| Domain | CLI, Testing |
Import
from datasets.commands.test import TestCommand
Class: TestCommand
Inherits from: BaseDatasetsCLICommand
Note: The class sets __test__ = False to prevent pytest from treating it as a test class.
Constructor
def __init__(
self,
dataset: str,
name: str,
cache_dir: str,
data_dir: str,
all_configs: bool,
save_infos: bool,
ignore_verifications: bool,
force_redownload: bool,
clear_cache: bool,
num_proc: int,
):
| Parameter | Type | Description |
|---|---|---|
dataset |
str |
Name or path of the dataset to test |
name |
str |
Dataset processing configuration name |
cache_dir |
str |
Directory where datasets are cached |
data_dir |
str |
Manual directory to load data files from |
all_configs |
bool |
Whether to test all dataset configurations |
save_infos |
bool |
Whether to save dataset infos to the dataset card (README.md) |
ignore_verifications |
bool |
Whether to skip checksums and splits verification |
force_redownload |
bool |
Whether to force redownloading the dataset |
clear_cache |
bool |
Whether to clear downloaded files and cache after each config test |
num_proc |
int |
Number of processes for parallel processing |
Validation rules:
- If
clear_cacheisTruebutcache_diris not specified, the command exits with an error message. - If
save_infosisTrue,ignore_verificationsis automatically set toTrue.
Methods
register_subcommand(parser)
Static method. Registers the test subcommand with the argument parser. Defines one positional argument (dataset) and multiple optional flags:
@staticmethod
def register_subcommand(parser: ArgumentParser):
test_parser = parser.add_parser("test", help="Test dataset loading.")
test_parser.add_argument("--name", type=str, default=None, help="Dataset processing name")
test_parser.add_argument("--cache_dir", type=str, default=None, help="Cache directory where the datasets are stored.")
test_parser.add_argument("--data_dir", type=str, default=None, help="Can be used to specify a manual directory to get the files from.")
test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations")
test_parser.add_argument("--save_info", action="store_true", help="Save the dataset infos in the dataset card (README.md)")
test_parser.add_argument("--ignore_verifications", action="store_true", help="Run the test without checksums and splits checks.")
test_parser.add_argument("--force_redownload", action="store_true", help="Force dataset redownload")
test_parser.add_argument("--clear_cache", action="store_true", help="Remove downloaded files and cached datasets after each config test")
test_parser.add_argument("--num_proc", type=int, default=None, help="Number of processes")
test_parser.add_argument("--save_infos", action="store_true", help="alias to save_info")
test_parser.add_argument("dataset", type=str, help="Name of the dataset to download")
test_parser.set_defaults(func=_test_command_factory)
run()
Executes the dataset testing workflow. The method follows these steps:
- Validates that
--nameand--all_configsare not used together. - Loads the dataset module via
dataset_module_factory(). - Resolves the builder class via
get_dataset_builder_class(). - Iterates over dataset configurations using an internal
get_builders()generator. - For each builder configuration:
- Calls
download_and_prepare()with the appropriate download mode and verification mode. - Calls
as_dataset()to verify the dataset loads correctly. - Optionally saves dataset info to a dataset card (README.md) if
save_infosis enabled. - Optionally clears the cache directory and download folder if
clear_cacheis enabled.
- Calls
- Prints "Test successful." on completion.
def run(self):
logging.getLogger("filelock").setLevel(ERROR)
if self._name is not None and self._all_configs:
print("Both parameters `config` and `all_configs` can't be used at once.")
exit(1)
path, config_name = self._dataset, self._name
module = dataset_module_factory(path)
builder_cls = get_dataset_builder_class(module)
n_builders = len(builder_cls.BUILDER_CONFIGS) if self._all_configs and builder_cls.BUILDER_CONFIGS else 1
def get_builders() -> Generator[DatasetBuilder, None, None]:
if self._all_configs and builder_cls.BUILDER_CONFIGS:
for i, config in enumerate(builder_cls.BUILDER_CONFIGS):
if "config_name" in module.builder_kwargs:
yield builder_cls(
cache_dir=self._cache_dir,
data_dir=self._data_dir,
**module.builder_kwargs,
)
else:
yield builder_cls(
config_name=config.name,
cache_dir=self._cache_dir,
data_dir=self._data_dir,
**module.builder_kwargs,
)
else:
if "config_name" in module.builder_kwargs:
yield builder_cls(cache_dir=self._cache_dir, data_dir=self._data_dir, **module.builder_kwargs)
else:
yield builder_cls(
config_name=config_name,
cache_dir=self._cache_dir,
data_dir=self._data_dir,
**module.builder_kwargs,
)
for j, builder in enumerate(get_builders()):
print(f"Testing builder '{builder.config.name}' ({j + 1}/{n_builders})")
builder.download_and_prepare(
download_mode=DownloadMode.REUSE_CACHE_IF_EXISTS
if not self._force_redownload
else DownloadMode.FORCE_REDOWNLOAD,
verification_mode=VerificationMode.NO_CHECKS
if self._ignore_verifications
else VerificationMode.ALL_CHECKS,
num_proc=self._num_proc,
)
builder.as_dataset()
if self._save_infos:
save_infos_dir = os.path.basename(path) if not os.path.isdir(path) else path
os.makedirs(save_infos_dir, exist_ok=True)
DatasetInfosDict(**{builder.config.name: builder.info}).write_to_directory(save_infos_dir)
if self._clear_cache:
if os.path.isdir(builder._cache_dir):
rmtree(builder._cache_dir)
download_dir = os.path.join(self._cache_dir, datasets.config.DOWNLOADED_DATASETS_DIR)
if os.path.isdir(download_dir):
rmtree(download_dir)
print("Test successful.")
I/O
| Direction | Description |
|---|---|
| Input | CLI arguments: dataset name/path, optional config name, cache/data directories, boolean flags for testing behavior |
| Output | Downloads and prepares dataset(s), prints progress and success/failure messages; optionally writes dataset info cards and manages cache |
Dependencies
| Module | Purpose |
|---|---|
datasets.builder.DatasetBuilder |
Base class for dataset builders |
datasets.commands.BaseDatasetsCLICommand |
Abstract base class for CLI commands |
datasets.download.download_manager.DownloadMode |
Controls download caching behavior |
datasets.info.DatasetInfosDict |
Dataset info serialization |
datasets.load.dataset_module_factory |
Loads the dataset module |
datasets.load.get_dataset_builder_class |
Resolves the builder class from a module |
datasets.utils.info_utils.VerificationMode |
Controls checksum/split verification |
shutil.rmtree |
Cache directory removal |
Usage
# Test a specific dataset
datasets-cli test my_dataset
# Test with a specific configuration
datasets-cli test my_dataset --name en
# Test all configurations with forced redownload
datasets-cli test my_dataset --all_configs --force_redownload
# Test and save dataset info
datasets-cli test my_dataset --save_info --cache_dir /tmp/test_cache
# Test with cache cleanup
datasets-cli test my_dataset --clear_cache --cache_dir /tmp/test_cache
# Test with parallel processing
datasets-cli test my_dataset --num_proc 4
Related Pages
- Huggingface_Datasets_Datasets_CLI
- Huggingface_Datasets_EnvironmentCommand
- Huggingface_Datasets_DeleteFromHubCommand