Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers Load Dataset

From Leeroopedia
Revision as of 13:06, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Transformers_Load_Dataset.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Training, Data Engineering
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for loading datasets into memory from the HuggingFace Hub, local files, or remote URLs, provided by the HuggingFace datasets library and used extensively within the HuggingFace Transformers training workflow.

Description

datasets.load_dataset() is the primary entry point for acquiring data in the HuggingFace ecosystem. It supports thousands of datasets hosted on the HuggingFace Hub as well as local files in CSV, JSON, Parquet, Arrow, and text formats. The function returns a DatasetDict (when no split is specified) or a single Dataset object (when a specific split is requested). Internally, it downloads and caches the data using Apache Arrow for efficient, memory-mapped access.

This is a wrapper doc because the function is defined in the external datasets library (not in the transformers repository itself), but it is the standard way to load data for Trainer-based training workflows and is used throughout the Transformers test suite and examples.

Usage

Use load_dataset() as the first step of any Trainer-based training pipeline to load training and evaluation data. It should be called before tokenization and before Trainer initialization.

Code Reference

Source Location

  • Repository: datasets (external)
  • File: datasets/load.py (external library)
  • Test usage: tests/trainer/test_trainer.py (lines 120-140 for imports and setup)

Signature

def load_dataset(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    split: Optional[Union[str, Split]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    keep_in_memory: Optional[bool] = None,
    save_infos: bool = False,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    streaming: bool = False,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    trust_remote_code: Optional[bool] = None,
    **config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:

Import

from datasets import load_dataset

I/O Contract

Inputs

Name Type Required Description
path str Yes Path or name of the dataset (Hub identifier like "glue", or local path, or file format like "csv")
name str No Name of the dataset configuration (e.g., "mrpc" for GLUE)
split str No Which split to load (e.g., "train", "validation", "test"). If None, returns a DatasetDict with all splits
data_files str or dict No Path(s) to local data files when loading custom data
data_dir str No Directory with the data files for custom dataset scripts
cache_dir str No Directory to cache downloaded datasets
streaming bool No If True, returns an IterableDataset for memory-efficient processing of large datasets
num_proc int No Number of processes for parallel data loading
token str or bool No HuggingFace Hub authentication token for private datasets
trust_remote_code bool No Whether to allow executing dataset loading scripts from the Hub

Outputs

Name Type Description
dataset DatasetDict or Dataset A DatasetDict mapping split names to Dataset objects (when split is None), or a single Dataset (when split is specified). Each Dataset is backed by an Apache Arrow table.

Usage Examples

Basic Usage

from datasets import load_dataset

# Load a dataset from the HuggingFace Hub
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

Loading a Specific Split

from datasets import load_dataset

# Load only the training split
train_dataset = load_dataset("glue", "mrpc", split="train")

Loading from Local Files

from datasets import load_dataset

# Load from local CSV files
dataset = load_dataset("csv", data_files={
    "train": "data/train.csv",
    "validation": "data/val.csv",
})

Streaming Large Datasets

from datasets import load_dataset

# Stream a large dataset without downloading it fully
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
for example in dataset:
    print(example["text"])
    break

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment