Implementation:Huggingface Diffusers LoRA Dataset Pipeline

Knowledge Sources	Diffusers Hugging Face Datasets
Domains	Diffusion_Models, Data_Preprocessing, Training_Pipelines
Last Updated	2026-02-13 21:00 GMT

Overview

Concrete tool for loading, preprocessing, and batching image-caption datasets for LoRA fine-tuning of text-to-image diffusion models, as implemented in the Diffusers training examples.

Description

This pattern combines Hugging Face datasets for data loading, torchvision.transforms for image preprocessing, and the model's tokenizer for caption processing into a complete data pipeline. The pipeline supports loading from the Hugging Face Hub by dataset name or from local directories using the imagefolder loader. Images are transformed through a configurable chain of resize, crop, flip, and normalization operations. Captions are tokenized using the CLIP tokenizer with padding and truncation to a fixed maximum length.

The dataset's with_transform method applies the preprocessing lazily (on-the-fly) rather than materializing all transformed examples in memory. A custom collate function stacks individual examples into batches suitable for the training loop. The accelerator.main_process_first() context manager ensures that only the main process performs dataset preparation operations (like shuffling and subsetting) while other processes wait.

Usage

Use this dataset pipeline when:

Fine-tuning Stable Diffusion with LoRA on custom image-caption data
Loading datasets from the Hugging Face Hub for training
Preparing local image directories with the imagefolder format
You need configurable image transforms and column name mapping

Code Reference

Source Location

Repository: diffusers
File: examples/text_to_image/train_text_to_image_lora.py
Lines: 604-714

Signature

# Dataset loading
dataset = load_dataset(
    args.dataset_name,
    args.dataset_config_name,
    cache_dir=args.cache_dir,
    data_dir=args.train_data_dir,
)

# Image transforms
train_transforms = transforms.Compose([
    transforms.Resize(args.resolution, interpolation=interpolation),
    transforms.CenterCrop(args.resolution),  # or RandomCrop
    transforms.RandomHorizontalFlip(),       # optional
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

# Caption tokenization
def tokenize_captions(examples, is_train=True):
    ...
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length,
        padding="max_length", truncation=True, return_tensors="pt"
    )
    return inputs.input_ids

Import

from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
dataset_name	`str`	Yes (or train_data_dir)	Name of the dataset on the Hugging Face Hub (e.g., `"lambdalabs/naruto-blip-captions"`).
dataset_config_name	`str`	No	Configuration name for datasets with multiple configs.
train_data_dir	`str`	No	Path to local training data directory. Used as `data_dir` with Hub datasets or as root for `imagefolder` loading.
image_column	`str`	No	Name of the column containing images. Auto-detected if not specified.
caption_column	`str`	No	Name of the column containing text captions. Auto-detected if not specified.
resolution	`int`	No	Target image resolution (height and width). Default: 512.
center_crop	`bool`	No	Use center crop instead of random crop. Default: False.
random_flip	`bool`	No	Apply random horizontal flip augmentation. Default: False.
max_train_samples	`int`	No	Limit the number of training samples for debugging.

Outputs

Name	Type	Description
train_dataloader	`torch.utils.data.DataLoader`	DataLoader yielding batches with `"pixel_values"` (shape `[B, 3, H, W]`, range [-1, 1]) and `"input_ids"` (shape `[B, max_length]`, int64 token IDs).

Usage Examples

Basic Usage

from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer
import torch

# Load tokenizer
tokenizer = CLIPTokenizer.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    subfolder="tokenizer",
)

# Load dataset from Hub
dataset = load_dataset("lambdalabs/naruto-blip-captions")

# Define image transforms
resolution = 512
train_transforms = transforms.Compose([
    transforms.Resize(resolution, interpolation=transforms.InterpolationMode.BILINEAR),
    transforms.CenterCrop(resolution),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

# Define preprocessing function
def preprocess_train(examples):
    images = [image.convert("RGB") for image in examples["image"]]
    examples["pixel_values"] = [train_transforms(image) for image in images]
    captions = examples["text"]
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length,
        padding="max_length", truncation=True, return_tensors="pt",
    )
    examples["input_ids"] = inputs.input_ids
    return examples

# Apply transforms lazily
train_dataset = dataset["train"].with_transform(preprocess_train)

# Create collate function and DataLoader
def collate_fn(examples):
    pixel_values = torch.stack([e["pixel_values"] for e in examples])
    pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()
    input_ids = torch.stack([e["input_ids"] for e in examples])
    return {"pixel_values": pixel_values, "input_ids": input_ids}

train_dataloader = torch.utils.data.DataLoader(
    train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=4,
)

Related Pages

Implements Principle

Principle:Huggingface_Diffusers_Training_Dataset_Preparation

Requires Environment

Environment:Huggingface_Diffusers_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment