Implementation:Huggingface Diffusers LoRA Dataset Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Data_Preprocessing, Training_Pipelines |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Concrete tool for loading, preprocessing, and batching image-caption datasets for LoRA fine-tuning of text-to-image diffusion models, as implemented in the Diffusers training examples.
Description
This pattern combines Hugging Face datasets for data loading, torchvision.transforms for image preprocessing, and the model's tokenizer for caption processing into a complete data pipeline. The pipeline supports loading from the Hugging Face Hub by dataset name or from local directories using the imagefolder loader. Images are transformed through a configurable chain of resize, crop, flip, and normalization operations. Captions are tokenized using the CLIP tokenizer with padding and truncation to a fixed maximum length.
The dataset's with_transform method applies the preprocessing lazily (on-the-fly) rather than materializing all transformed examples in memory. A custom collate function stacks individual examples into batches suitable for the training loop. The accelerator.main_process_first() context manager ensures that only the main process performs dataset preparation operations (like shuffling and subsetting) while other processes wait.
Usage
Use this dataset pipeline when:
- Fine-tuning Stable Diffusion with LoRA on custom image-caption data
- Loading datasets from the Hugging Face Hub for training
- Preparing local image directories with the imagefolder format
- You need configurable image transforms and column name mapping
Code Reference
Source Location
- Repository: diffusers
- File:
examples/text_to_image/train_text_to_image_lora.py - Lines: 604-714
Signature
# Dataset loading
dataset = load_dataset(
args.dataset_name,
args.dataset_config_name,
cache_dir=args.cache_dir,
data_dir=args.train_data_dir,
)
# Image transforms
train_transforms = transforms.Compose([
transforms.Resize(args.resolution, interpolation=interpolation),
transforms.CenterCrop(args.resolution), # or RandomCrop
transforms.RandomHorizontalFlip(), # optional
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
])
# Caption tokenization
def tokenize_captions(examples, is_train=True):
...
inputs = tokenizer(
captions, max_length=tokenizer.model_max_length,
padding="max_length", truncation=True, return_tensors="pt"
)
return inputs.input_ids
Import
from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_name | str |
Yes (or train_data_dir) | Name of the dataset on the Hugging Face Hub (e.g., "lambdalabs/naruto-blip-captions").
|
| dataset_config_name | str |
No | Configuration name for datasets with multiple configs. |
| train_data_dir | str |
No | Path to local training data directory. Used as data_dir with Hub datasets or as root for imagefolder loading.
|
| image_column | str |
No | Name of the column containing images. Auto-detected if not specified. |
| caption_column | str |
No | Name of the column containing text captions. Auto-detected if not specified. |
| resolution | int |
No | Target image resolution (height and width). Default: 512. |
| center_crop | bool |
No | Use center crop instead of random crop. Default: False. |
| random_flip | bool |
No | Apply random horizontal flip augmentation. Default: False. |
| max_train_samples | int |
No | Limit the number of training samples for debugging. |
Outputs
| Name | Type | Description |
|---|---|---|
| train_dataloader | torch.utils.data.DataLoader |
DataLoader yielding batches with "pixel_values" (shape [B, 3, H, W], range [-1, 1]) and "input_ids" (shape [B, max_length], int64 token IDs).
|
Usage Examples
Basic Usage
from datasets import load_dataset
from torchvision import transforms
from transformers import CLIPTokenizer
import torch
# Load tokenizer
tokenizer = CLIPTokenizer.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
subfolder="tokenizer",
)
# Load dataset from Hub
dataset = load_dataset("lambdalabs/naruto-blip-captions")
# Define image transforms
resolution = 512
train_transforms = transforms.Compose([
transforms.Resize(resolution, interpolation=transforms.InterpolationMode.BILINEAR),
transforms.CenterCrop(resolution),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
])
# Define preprocessing function
def preprocess_train(examples):
images = [image.convert("RGB") for image in examples["image"]]
examples["pixel_values"] = [train_transforms(image) for image in images]
captions = examples["text"]
inputs = tokenizer(
captions, max_length=tokenizer.model_max_length,
padding="max_length", truncation=True, return_tensors="pt",
)
examples["input_ids"] = inputs.input_ids
return examples
# Apply transforms lazily
train_dataset = dataset["train"].with_transform(preprocess_train)
# Create collate function and DataLoader
def collate_fn(examples):
pixel_values = torch.stack([e["pixel_values"] for e in examples])
pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()
input_ids = torch.stack([e["input_ids"] for e in examples])
return {"pixel_values": pixel_values, "input_ids": input_ids}
train_dataloader = torch.utils.data.DataLoader(
train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=4,
)