Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Load Dataset For Verification

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for verifying a dataset uploaded to the Hub by loading it back provided by the HuggingFace Datasets library.

Description

load_dataset loads a dataset from the Hugging Face Hub (or a local directory). For verification purposes, it is called after push_to_hub to load the just-published dataset and confirm correctness. The function resolves the dataset builder from the repository structure, downloads and caches data files, processes them into Arrow tables, and returns a DatasetDict (when no split is specified) or a single Dataset (when a split is specified). When streaming=True, it returns an iterable dataset without downloading. The function supports configuration selection, custom data files, revision pinning, and download mode control.

Usage

Use load_dataset after publishing to the Hub to verify that the dataset loads correctly, has the expected splits, features, and row counts. This round-trip verification confirms end-to-end data integrity.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/load.py
  • Lines: 1278-1519

Signature

def load_dataset(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    split: Optional[Union[str, Split, list[str], list[Split]]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    keep_in_memory: Optional[bool] = None,
    save_infos: bool = False,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    streaming: bool = False,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:

Import

from datasets import load_dataset

I/O Contract

Inputs

Name Type Required Description
path str Yes Path or name of the dataset (e.g., "username/dataset_name" for Hub, or local directory path).
name str No Configuration (subset) name.
split str or Split No Which split to load. None returns all splits as a DatasetDict.
revision str or Version No Version/branch/commit of the dataset to load.
token str or bool No Authentication token for private datasets.
streaming bool No Whether to stream data without downloading. Defaults to False.
**config_kwargs No Additional keyword arguments passed to the dataset builder.

Outputs

Name Type Description
return DatasetDict or Dataset or IterableDatasetDict or IterableDataset The loaded dataset, type depends on split and streaming parameters.

Usage Examples

Verification After Upload

from datasets import load_dataset

# After push_to_hub, verify the upload
ds = load_dataset("my-username/my-dataset")

# Check splits
print(ds)
# DatasetDict({
#     train: Dataset({ features: ['text', 'label'], num_rows: 1000 })
#     test: Dataset({ features: ['text', 'label'], num_rows: 200 })
# })

# Check features
print(ds["train"].features)

# Check specific examples
print(ds["train"][0])

# Verify a specific configuration
ds_en = load_dataset("my-username/my-dataset", name="en")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment