Implementation:Huggingface Datasets Load Dataset For Verification

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for verifying a dataset uploaded to the Hub by loading it back provided by the HuggingFace Datasets library.

Description

load_dataset loads a dataset from the Hugging Face Hub (or a local directory). For verification purposes, it is called after push_to_hub to load the just-published dataset and confirm correctness. The function resolves the dataset builder from the repository structure, downloads and caches data files, processes them into Arrow tables, and returns a DatasetDict (when no split is specified) or a single Dataset (when a split is specified). When streaming=True, it returns an iterable dataset without downloading. The function supports configuration selection, custom data files, revision pinning, and download mode control.

Usage

Use load_dataset after publishing to the Hub to verify that the dataset loads correctly, has the expected splits, features, and row counts. This round-trip verification confirms end-to-end data integrity.

Code Reference

Source Location

Repository: datasets
File: src/datasets/load.py
Lines: 1278-1519

Signature

def load_dataset(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    keep_in_memory: Optional[bool] = None,
    save_infos: bool = False,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    streaming: bool = False,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:

Import

from datasets import load_dataset

I/O Contract

Inputs

Name	Type	Required	Description
path	`str`	Yes	Path or name of the dataset (e.g., `"username/dataset_name"` for Hub, or local directory path).
name	`str`	No	Configuration (subset) name.
split	`str or Split`	No	Which split to load. None returns all splits as a DatasetDict.
revision	`str or Version`	No	Version/branch/commit of the dataset to load.
token	`str or bool`	No	Authentication token for private datasets.
streaming	`bool`	No	Whether to stream data without downloading. Defaults to False.
**config_kwargs		No	Additional keyword arguments passed to the dataset builder.

Outputs

Name	Type	Description
return	`DatasetDict or Dataset or IterableDatasetDict or IterableDataset`	The loaded dataset, type depends on split and streaming parameters.

Usage Examples

Verification After Upload

from datasets import load_dataset

# After push_to_hub, verify the upload
ds = load_dataset("my-username/my-dataset")

# Check splits
print(ds)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 1000 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 200 })
# })

# Check features
print(ds["train"].features)

# Check specific examples
print(ds["train"][0])

# Verify a specific configuration
ds_en = load_dataset("my-username/my-dataset", name="en")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Hub_Upload_Verification

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment