Implementation:Huggingface Transformers Load Dataset

Knowledge Sources	Transformers Transformers Docs Datasets Docs
Domains	NLP, Training, Data Engineering
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for loading datasets into memory from the HuggingFace Hub, local files, or remote URLs, provided by the HuggingFace datasets library and used extensively within the HuggingFace Transformers training workflow.

Description

datasets.load_dataset() is the primary entry point for acquiring data in the HuggingFace ecosystem. It supports thousands of datasets hosted on the HuggingFace Hub as well as local files in CSV, JSON, Parquet, Arrow, and text formats. The function returns a DatasetDict (when no split is specified) or a single Dataset object (when a specific split is requested). Internally, it downloads and caches the data using Apache Arrow for efficient, memory-mapped access.

This is a wrapper doc because the function is defined in the external datasets library (not in the transformers repository itself), but it is the standard way to load data for Trainer-based training workflows and is used throughout the Transformers test suite and examples.

Usage

Use load_dataset() as the first step of any Trainer-based training pipeline to load training and evaluation data. It should be called before tokenization and before Trainer initialization.

Code Reference

Source Location

Repository: datasets (external)
File: datasets/load.py (external library)
Test usage: tests/trainer/test_trainer.py (lines 120-140 for imports and setup)

Signature

def load_dataset(
    path: str,
    name: Optional[str] = None,
    data_dir: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    split: Optional[Union[str, Split]] = None,
    cache_dir: Optional[str] = None,
    features: Optional[Features] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    verification_mode: Optional[Union[VerificationMode, str]] = None,
    keep_in_memory: Optional[bool] = None,
    save_infos: bool = False,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    streaming: bool = False,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    trust_remote_code: Optional[bool] = None,
    **config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:

Import

from datasets import load_dataset

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path or name of the dataset (Hub identifier like "glue", or local path, or file format like "csv")
name	str	No	Name of the dataset configuration (e.g., "mrpc" for GLUE)
split	str	No	Which split to load (e.g., "train", "validation", "test"). If None, returns a DatasetDict with all splits
data_files	str or dict	No	Path(s) to local data files when loading custom data
data_dir	str	No	Directory with the data files for custom dataset scripts
cache_dir	str	No	Directory to cache downloaded datasets
streaming	bool	No	If True, returns an IterableDataset for memory-efficient processing of large datasets
num_proc	int	No	Number of processes for parallel data loading
token	str or bool	No	HuggingFace Hub authentication token for private datasets
trust_remote_code	bool	No	Whether to allow executing dataset loading scripts from the Hub

Outputs

Name	Type	Description
dataset	DatasetDict or Dataset	A DatasetDict mapping split names to Dataset objects (when split is None), or a single Dataset (when split is specified). Each Dataset is backed by an Apache Arrow table.

Usage Examples

Basic Usage

from datasets import load_dataset

# Load a dataset from the HuggingFace Hub
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

Loading a Specific Split

from datasets import load_dataset

# Load only the training split
train_dataset = load_dataset("glue", "mrpc", split="train")

Loading from Local Files

from datasets import load_dataset

# Load from local CSV files
dataset = load_dataset("csv", data_files={
    "train": "data/train.csv",
    "validation": "data/val.csv",
})

Streaming Large Datasets

from datasets import load_dataset

# Stream a large dataset without downloading it fully
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
for example in dataset:
    print(example["text"])
    break

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Data_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment