Implementation:Huggingface Datasets PyTorch DataLoader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for wrapping a HuggingFace Dataset in a PyTorch DataLoader for batched, parallelized training provided by PyTorch's torch.utils.data module.
Description
HuggingFace Dataset objects implement the __getitem__ and __len__ protocols required by PyTorch's map-style dataset interface. When combined with Dataset.with_format("torch"), a torch.utils.data.DataLoader can iterate over the dataset producing batched torch tensors. This is a wrapper documentation page: HuggingFace Datasets does not provide its own DataLoader class but instead ensures compatibility with PyTorch's standard DataLoader. All DataLoader features are supported: batch_size, shuffle, num_workers, pin_memory, drop_last, custom collate_fn (e.g., DataCollatorWithPadding), and custom sampler objects. For streaming scenarios, IterableDataset objects can be wrapped in a DataLoader with num_workers for shard-parallel loading.
Usage
Use this integration pattern whenever you are training or evaluating PyTorch models and need batched, optionally parallelized iteration over a HuggingFace Dataset. This is the standard approach used by the HuggingFace Trainer and in custom PyTorch training loops.
Code Reference
Source Location
- Repository: pytorch (external)
- File:
torch/utils/data/dataloader.py - Lines: N/A (external library)
Signature
# PyTorch's DataLoader wrapping a HuggingFace Dataset
torch.utils.data.DataLoader(
dataset, # HuggingFace Dataset with format set to "torch"
batch_size=1,
shuffle=False,
sampler=None,
num_workers=0,
collate_fn=None,
pin_memory=False,
drop_last=False,
...
)
Import
import torch
from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset |
Yes | A HuggingFace Dataset, typically with .with_format("torch") applied. |
| batch_size | int |
No | Number of samples per batch. Defaults to 1. |
| shuffle | bool |
No | Whether to shuffle indices at every epoch. Defaults to False. |
| num_workers | int |
No | Number of subprocesses for data loading. 0 means main process. Defaults to 0. |
| collate_fn | Optional[Callable] |
No | Custom function to collate samples into a batch (e.g., DataCollatorWithPadding). |
| pin_memory | bool |
No | If True, copy tensors to CUDA pinned memory before returning. Defaults to False. |
| drop_last | bool |
No | Drop the last incomplete batch. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| batches | Iterator[dict[str, torch.Tensor]] |
An iterator yielding dicts of batched torch tensors (or whatever the collate_fn returns). |
Usage Examples
Basic Usage
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
# Load and tokenize
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
# Set format to PyTorch tensors
ds = ds.with_format("torch")
# Create a DataLoader
dataloader = torch.utils.data.DataLoader(
ds,
batch_size=16,
shuffle=True,
num_workers=4,
pin_memory=True,
)
# Training loop
for batch in dataloader:
input_ids = batch["input_ids"] # torch.Tensor of shape [16, seq_len]
labels = batch["label"] # torch.Tensor of shape [16]
# ... forward pass, loss, backward ...
# With a custom collate function for dynamic padding
collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")
ds_unpadded = ds.with_format("torch", columns=["input_ids", "attention_mask", "label"])
dataloader = torch.utils.data.DataLoader(ds_unpadded, batch_size=16, collate_fn=collator)