Workflow:Huggingface Diffusers DreamBooth Personalization

Knowledge Sources	Diffusers DreamBooth Training DreamBooth
Domains	Diffusion_Models, Fine_Tuning, Personalization
Last Updated	2026-02-13 21:00 GMT

Overview

End-to-end process for personalizing a diffusion model to generate images of a specific subject using just a few reference images, with optional LoRA for parameter efficiency.

Description

This workflow implements DreamBooth, a technique for teaching a pretrained diffusion model to associate a unique identifier token with a specific subject (person, object, pet, style). Given as few as 3-5 images of the subject, the model learns to generate that subject in novel contexts and poses. The workflow uses LoRA adapters for memory-efficient training and includes prior preservation regularization to prevent the model from forgetting its general knowledge. Both the UNet and the text encoder can optionally be adapted with separate LoRA configurations. The result is a small adapter file that, when loaded, enables the base model to generate the personalized subject on demand.

Usage

Execute this workflow when you have a small set of images (typically 3-10) depicting a specific subject and want the model to generate new images of that subject in different contexts. Common use cases include creating personalized avatars, generating product images in various settings, or teaching the model to reproduce a specific art style from a few examples. This differs from general LoRA fine-tuning in that it targets a single concept with minimal data rather than adapting to a broad dataset.

Execution Steps

Step 1: Instance Data Collection

Gather a small set of high-quality reference images of the target subject. These images should show the subject from different angles and lighting conditions while maintaining consistent identity. Choose a unique identifier token (e.g., "sks") that will be bound to the subject during training.

Key considerations:

3-10 images typically suffice; more is not always better
Images should be diverse in pose and background but consistent in the subject
Choose an identifier token that is rare in the model's vocabulary to avoid conflicts
Construct an instance prompt following the pattern: "a photo of [identifier] [class]"
The class word (e.g., "dog", "person") anchors the model's prior knowledge

Step 2: Prior Preservation Generation

Generate regularization images using the base model with the class prompt (without the identifier). These class images serve as a regularization signal during training, preventing the model from collapsing the entire class concept into the specific subject. The model generates these images once before training begins.

Key considerations:

Typically generate 100-200 class images
Use the class prompt only (e.g., "a photo of a dog") without the identifier
Images are generated at the same resolution as training images
This step can be skipped but significantly degrades output diversity
Generated images are cached and reused across training runs

Step 3: Model Loading and Freezing

Load all pretrained model components: tokenizer, text encoder, VAE, UNet, and noise scheduler. The loader dynamically selects the correct text encoder class based on the model architecture. Freeze all model parameters to prepare for LoRA injection.

Key considerations:

Supports multiple model families (Stable Diffusion, IF, etc.) through dynamic class selection
Models without a VAE (e.g., IF) skip VAE loading gracefully
All components are cast to the appropriate dtype for memory efficiency
The text encoder can optionally be included in training for stronger concept binding

Step 4: Dual LoRA Configuration

Configure and inject LoRA adapters into both the UNet and optionally the text encoder. The UNet adapter targets attention and cross-attention projection layers including the added key/value projections. The text encoder adapter targets its internal attention layers for stronger concept association.

Key considerations:

UNet targets include standard attention projections plus add_k_proj and add_v_proj
Text encoder targets include q_proj, k_proj, v_proj, and out_proj
Separate rank and alpha values can be set for UNet and text encoder adapters
Training the text encoder improves concept fidelity but increases memory usage
Cast all trainable parameters to float32 for mixed-precision stability

Step 5: Dataset Construction

Build the DreamBooth dataset combining instance images (subject photos) with class images (regularization). The dataset handles pairing each image with its corresponding prompt, applying image transformations, and managing the instance/class split within each batch.

Key considerations:

Instance and class images are interleaved in training batches
Text embeddings can be pre-computed to save memory when not training the text encoder
Image augmentations include resize, crop, and optional horizontal flip
The dataset returns paired pixel values and text encodings for both instance and class samples

Step 6: Training Loop with Prior Preservation

Execute the training loop with the combined instance and regularization loss. For each batch: encode images to latents, sample noise and timesteps, predict noise with the adapted model, then compute separate losses for instance and class samples. The total loss is a weighted combination that balances concept learning with prior preservation.

Key considerations:

Instance loss teaches the model the new concept
Prior preservation loss prevents catastrophic forgetting of the class concept
The prior loss weight (default 1.0) balances concept learning against preservation
The batch is split: first half contains instance samples, second half contains class samples
Training typically requires 400-1200 steps depending on the number of instance images

Step 7: Validation and Export

Generate validation images using the trained adapter to verify concept learning. Save both UNet and text encoder LoRA weights in the diffusers-compatible format. Optionally push the trained adapter to the Hugging Face Hub.

Key considerations:

Validation images can be conditioned on both prompts and reference images
Both UNet and text encoder adapter weights are saved together
The combined adapter file is typically a few MB
The adapter can be loaded into any compatible base model pipeline

Execution Diagram

GitHub URL

Workflow Repository