Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets PyTorch DataLoader

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for wrapping a HuggingFace Dataset in a PyTorch DataLoader for batched, parallelized training provided by PyTorch's torch.utils.data module.

Description

HuggingFace Dataset objects implement the __getitem__ and __len__ protocols required by PyTorch's map-style dataset interface. When combined with Dataset.with_format("torch"), a torch.utils.data.DataLoader can iterate over the dataset producing batched torch tensors. This is a wrapper documentation page: HuggingFace Datasets does not provide its own DataLoader class but instead ensures compatibility with PyTorch's standard DataLoader. All DataLoader features are supported: batch_size, shuffle, num_workers, pin_memory, drop_last, custom collate_fn (e.g., DataCollatorWithPadding), and custom sampler objects. For streaming scenarios, IterableDataset objects can be wrapped in a DataLoader with num_workers for shard-parallel loading.

Usage

Use this integration pattern whenever you are training or evaluating PyTorch models and need batched, optionally parallelized iteration over a HuggingFace Dataset. This is the standard approach used by the HuggingFace Trainer and in custom PyTorch training loops.

Code Reference

Source Location

  • Repository: pytorch (external)
  • File: torch/utils/data/dataloader.py
  • Lines: N/A (external library)

Signature

# PyTorch's DataLoader wrapping a HuggingFace Dataset
torch.utils.data.DataLoader(
    dataset,           # HuggingFace Dataset with format set to "torch"
    batch_size=1,
    shuffle=False,
    sampler=None,
    num_workers=0,
    collate_fn=None,
    pin_memory=False,
    drop_last=False,
    ...
)

Import

import torch
from datasets import Dataset

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes A HuggingFace Dataset, typically with .with_format("torch") applied.
batch_size int No Number of samples per batch. Defaults to 1.
shuffle bool No Whether to shuffle indices at every epoch. Defaults to False.
num_workers int No Number of subprocesses for data loading. 0 means main process. Defaults to 0.
collate_fn Optional[Callable] No Custom function to collate samples into a batch (e.g., DataCollatorWithPadding).
pin_memory bool No If True, copy tensors to CUDA pinned memory before returning. Defaults to False.
drop_last bool No Drop the last incomplete batch. Defaults to False.

Outputs

Name Type Description
batches Iterator[dict[str, torch.Tensor]] An iterator yielding dicts of batched torch tensors (or whatever the collate_fn returns).

Usage Examples

Basic Usage

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load and tokenize
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)

# Set format to PyTorch tensors
ds = ds.with_format("torch")

# Create a DataLoader
dataloader = torch.utils.data.DataLoader(
    ds,
    batch_size=16,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
)

# Training loop
for batch in dataloader:
    input_ids = batch["input_ids"]       # torch.Tensor of shape [16, seq_len]
    labels = batch["label"]              # torch.Tensor of shape [16]
    # ... forward pass, loss, backward ...

# With a custom collate function for dynamic padding
collator = DataCollatorWithPadding(tokenizer, return_tensors="pt")
ds_unpadded = ds.with_format("torch", columns=["input_ids", "attention_mask", "label"])
dataloader = torch.utils.data.DataLoader(ds_unpadded, batch_size=16, collate_fn=collator)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment