Implementation:Huggingface Datasets Load Dataset For Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for verifying a dataset uploaded to the Hub by loading it back provided by the HuggingFace Datasets library.
Description
load_dataset loads a dataset from the Hugging Face Hub (or a local directory). For verification purposes, it is called after push_to_hub to load the just-published dataset and confirm correctness. The function resolves the dataset builder from the repository structure, downloads and caches data files, processes them into Arrow tables, and returns a DatasetDict (when no split is specified) or a single Dataset (when a split is specified). When streaming=True, it returns an iterable dataset without downloading. The function supports configuration selection, custom data files, revision pinning, and download mode control.
Usage
Use load_dataset after publishing to the Hub to verify that the dataset loads correctly, has the expected splits, features, and row counts. This round-trip verification confirms end-to-end data integrity.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/load.py - Lines: 1278-1519
Signature
def load_dataset(
path: str,
name: Optional[str] = None,
data_dir: Optional[str] = None,
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
split: Optional[Union[str, Split, list[str], list[Split]]] = None,
cache_dir: Optional[str] = None,
features: Optional[Features] = None,
download_config: Optional[DownloadConfig] = None,
download_mode: Optional[Union[DownloadMode, str]] = None,
verification_mode: Optional[Union[VerificationMode, str]] = None,
keep_in_memory: Optional[bool] = None,
save_infos: bool = False,
revision: Optional[Union[str, Version]] = None,
token: Optional[Union[bool, str]] = None,
streaming: bool = False,
num_proc: Optional[int] = None,
storage_options: Optional[dict] = None,
**config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:
Import
from datasets import load_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str |
Yes | Path or name of the dataset (e.g., "username/dataset_name" for Hub, or local directory path).
|
| name | str |
No | Configuration (subset) name. |
| split | str or Split |
No | Which split to load. None returns all splits as a DatasetDict. |
| revision | str or Version |
No | Version/branch/commit of the dataset to load. |
| token | str or bool |
No | Authentication token for private datasets. |
| streaming | bool |
No | Whether to stream data without downloading. Defaults to False. |
| **config_kwargs | No | Additional keyword arguments passed to the dataset builder. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | DatasetDict or Dataset or IterableDatasetDict or IterableDataset |
The loaded dataset, type depends on split and streaming parameters. |
Usage Examples
Verification After Upload
from datasets import load_dataset
# After push_to_hub, verify the upload
ds = load_dataset("my-username/my-dataset")
# Check splits
print(ds)
# DatasetDict({
# train: Dataset({ features: ['text', 'label'], num_rows: 1000 })
# test: Dataset({ features: ['text', 'label'], num_rows: 200 })
# })
# Check features
print(ds["train"].features)
# Check specific examples
print(ds["train"][0])
# Verify a specific configuration
ds_en = load_dataset("my-username/my-dataset", name="en")