Implementation:Huggingface Transformers Load Dataset
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Data Engineering |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for loading datasets into memory from the HuggingFace Hub, local files, or remote URLs, provided by the HuggingFace datasets library and used extensively within the HuggingFace Transformers training workflow.
Description
datasets.load_dataset() is the primary entry point for acquiring data in the HuggingFace ecosystem. It supports thousands of datasets hosted on the HuggingFace Hub as well as local files in CSV, JSON, Parquet, Arrow, and text formats. The function returns a DatasetDict (when no split is specified) or a single Dataset object (when a specific split is requested). Internally, it downloads and caches the data using Apache Arrow for efficient, memory-mapped access.
This is a wrapper doc because the function is defined in the external datasets library (not in the transformers repository itself), but it is the standard way to load data for Trainer-based training workflows and is used throughout the Transformers test suite and examples.
Usage
Use load_dataset() as the first step of any Trainer-based training pipeline to load training and evaluation data. It should be called before tokenization and before Trainer initialization.
Code Reference
Source Location
- Repository: datasets (external)
- File: datasets/load.py (external library)
- Test usage: tests/trainer/test_trainer.py (lines 120-140 for imports and setup)
Signature
def load_dataset(
path: str,
name: Optional[str] = None,
data_dir: Optional[str] = None,
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
split: Optional[Union[str, Split]] = None,
cache_dir: Optional[str] = None,
features: Optional[Features] = None,
download_config: Optional[DownloadConfig] = None,
download_mode: Optional[Union[DownloadMode, str]] = None,
verification_mode: Optional[Union[VerificationMode, str]] = None,
keep_in_memory: Optional[bool] = None,
save_infos: bool = False,
revision: Optional[Union[str, Version]] = None,
token: Optional[Union[bool, str]] = None,
streaming: bool = False,
num_proc: Optional[int] = None,
storage_options: Optional[dict] = None,
trust_remote_code: Optional[bool] = None,
**config_kwargs,
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:
Import
from datasets import load_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path or name of the dataset (Hub identifier like "glue", or local path, or file format like "csv") |
| name | str | No | Name of the dataset configuration (e.g., "mrpc" for GLUE) |
| split | str | No | Which split to load (e.g., "train", "validation", "test"). If None, returns a DatasetDict with all splits |
| data_files | str or dict | No | Path(s) to local data files when loading custom data |
| data_dir | str | No | Directory with the data files for custom dataset scripts |
| cache_dir | str | No | Directory to cache downloaded datasets |
| streaming | bool | No | If True, returns an IterableDataset for memory-efficient processing of large datasets |
| num_proc | int | No | Number of processes for parallel data loading |
| token | str or bool | No | HuggingFace Hub authentication token for private datasets |
| trust_remote_code | bool | No | Whether to allow executing dataset loading scripts from the Hub |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | DatasetDict or Dataset | A DatasetDict mapping split names to Dataset objects (when split is None), or a single Dataset (when split is specified). Each Dataset is backed by an Apache Arrow table. |
Usage Examples
Basic Usage
from datasets import load_dataset
# Load a dataset from the HuggingFace Hub
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
Loading a Specific Split
from datasets import load_dataset
# Load only the training split
train_dataset = load_dataset("glue", "mrpc", split="train")
Loading from Local Files
from datasets import load_dataset
# Load from local CSV files
dataset = load_dataset("csv", data_files={
"train": "data/train.csv",
"validation": "data/val.csv",
})
Streaming Large Datasets
from datasets import load_dataset
# Stream a large dataset without downloading it fully
dataset = load_dataset("allenai/c4", "en", split="train", streaming=True)
for example in dataset:
print(example["text"])
break