Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Diffusers LoRA Dataset Pipeline

From Leeroopedia
Knowledge Sources
Domains Diffusion_Models, Data_Preprocessing, Training_Pipelines
Last Updated 2026-02-13 21:00 GMT

Overview

Concrete tool for loading, preprocessing, and batching image-caption datasets for LoRA fine-tuning of text-to-image diffusion models, as implemented in the Diffusers training examples.

Description

This pattern combines Hugging Face datasets for data loading, torchvision.transforms for image preprocessing, and the model's tokenizer for caption processing into a complete data pipeline. The pipeline supports loading from the Hugging Face Hub by dataset name or from local directories using the imagefolder loader. Images are transformed through a configurable chain of resize, crop, flip, and normalization operations. Captions are tokenized using the CLIP tokenizer with padding and truncation to a fixed maximum length.

The dataset's with_transform method applies the preprocessing lazily (on-the-fly) rather than materializing all transformed examples in memory. A custom collate function stacks individual examples into batches suitable for the training loop. The accelerator.main_process_first() context manager ensures that only the main process performs dataset preparation operations (like shuffling and subsetting) while other processes wait.

Usage

Use this dataset pipeline when:

  • Fine-tuning Stable Diffusion with LoRA on custom image-caption data
  • Loading datasets from the Hugging Face Hub for training
  • Preparing local image directories with the imagefolder format
  • You need configurable image transforms and column name mapping

Code Reference

Source Location

  • Repository: diffusers
  • File: examples/text_to_image/train_text_to_image_lora.py
  • Lines: 604-714

Signature

# Dataset loading
dataset = load_dataset(
    args.dataset_name,
    args.dataset_config_name,
    cache_dir=args.cache_dir,
    data_dir=args.train_data_dir,
)

# Image transforms
train_transforms = transforms.Compose([
    transforms.Resize(args.resolution, interpolation=interpolation),
    transforms.CenterCrop(args.resolution),  # or RandomCrop
    transforms.RandomHorizontalFlip(),       # optional
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

# Caption tokenization
def tokenize_captions(examples, is_train=True):
    ...
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length,
        padding="max_length", truncation=True, return_tensors="pt"
    )
    return inputs.input_ids

Import

from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer

I/O Contract

Inputs

Name Type Required Description
dataset_name str Yes (or train_data_dir) Name of the dataset on the Hugging Face Hub (e.g., "lambdalabs/naruto-blip-captions").
dataset_config_name str No Configuration name for datasets with multiple configs.
train_data_dir str No Path to local training data directory. Used as data_dir with Hub datasets or as root for imagefolder loading.
image_column str No Name of the column containing images. Auto-detected if not specified.
caption_column str No Name of the column containing text captions. Auto-detected if not specified.
resolution int No Target image resolution (height and width). Default: 512.
center_crop bool No Use center crop instead of random crop. Default: False.
random_flip bool No Apply random horizontal flip augmentation. Default: False.
max_train_samples int No Limit the number of training samples for debugging.

Outputs

Name Type Description
train_dataloader torch.utils.data.DataLoader DataLoader yielding batches with "pixel_values" (shape [B, 3, H, W], range [-1, 1]) and "input_ids" (shape [B, max_length], int64 token IDs).

Usage Examples

Basic Usage

from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer
import torch

# Load tokenizer
tokenizer = CLIPTokenizer.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    subfolder="tokenizer",
)

# Load dataset from Hub
dataset = load_dataset("lambdalabs/naruto-blip-captions")

# Define image transforms
resolution = 512
train_transforms = transforms.Compose([
    transforms.Resize(resolution, interpolation=transforms.InterpolationMode.BILINEAR),
    transforms.CenterCrop(resolution),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

# Define preprocessing function
def preprocess_train(examples):
    images = [image.convert("RGB") for image in examples["image"]]
    examples["pixel_values"] = [train_transforms(image) for image in images]
    captions = examples["text"]
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length,
        padding="max_length", truncation=True, return_tensors="pt",
    )
    examples["input_ids"] = inputs.input_ids
    return examples

# Apply transforms lazily
train_dataset = dataset["train"].with_transform(preprocess_train)

# Create collate function and DataLoader
def collate_fn(examples):
    pixel_values = torch.stack([e["pixel_values"] for e in examples])
    pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()
    input_ids = torch.stack([e["input_ids"] for e in examples])
    return {"pixel_values": pixel_values, "input_ids": input_ids}

train_dataloader = torch.utils.data.DataLoader(
    train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=4,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment