Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Peft DreamBooth LoRA Diffusion

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Huggingface_Peft_DreamBooth_LoRA_Diffusion.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Diffusion_Models, Fine_Tuning, Image_Generation, DreamBooth
Last Updated 2026-02-07 06:00 GMT

Overview

End-to-end process for personalizing a Stable Diffusion model to generate images of a specific subject using DreamBooth with PEFT adapters (LoRA, LoHa, or LoKr) applied to the UNet and optionally the text encoder.

Description

This workflow demonstrates how to use PEFT adapters for DreamBooth-style personalization of Stable Diffusion models. DreamBooth teaches a diffusion model to associate a unique identifier (e.g., "sks dog") with a specific subject from a small set of reference images (typically 3-10). By applying LoRA, LoHa, or LoKr adapters instead of full fine-tuning, the process requires significantly less GPU memory and produces small adapter checkpoints rather than full model copies. The workflow supports prior preservation loss to maintain the model's general generation capabilities while learning the new subject. Training is orchestrated via Accelerate for optional multi-GPU support.

Usage

Execute this workflow when you have a small set of images (3-10) of a specific subject (person, pet, object, style) and want to teach a Stable Diffusion model to generate new images of that subject in various contexts and poses. This is useful for personalized avatar generation, product visualization, artistic style transfer, or any scenario requiring subject-specific image generation with limited training data.

Execution Steps

Step 1: Load Stable Diffusion Pipeline Components

Load the individual components of a pre-trained Stable Diffusion pipeline: the tokenizer, noise scheduler (DDPMScheduler), variational autoencoder (VAE), UNet denoising network, and CLIP text encoder. Freeze the VAE parameters entirely since it only serves as an image encoder/decoder and does not need training.

Key considerations:

  • Each pipeline component is loaded separately for independent control
  • The VAE is frozen and set to eval mode throughout training
  • The noise scheduler defines the diffusion process (forward and reverse)
  • Use the same pretrained model path for all components to ensure compatibility

Step 2: Apply PEFT Adapters to UNet and Text Encoder

Create the adapter configuration (LoraConfig, LoHaConfig, or LoKrConfig) with target modules appropriate for the UNet architecture (typically cross-attention query and value projections). Apply the adapter to the UNet using get_peft_model. Optionally apply a separate adapter to the text encoder to allow fine-tuning of text conditioning as well.

Key considerations:

  • UNet target modules typically include to_q, to_v, to_k, to_out.0 for cross-attention
  • Text encoder targets are typically q_proj, v_proj for CLIP attention layers
  • LoHa and LoKr offer alternative decomposition approaches to LoRA
  • The adapter type and hyperparameters can differ between UNet and text encoder

Step 3: Generate Prior Preservation Images

If prior preservation is enabled, generate a set of class images (e.g., generic "dog" images) using the pre-trained pipeline before training begins. These images serve as a regularization dataset during training, preventing the model from forgetting how to generate the broader class while learning the specific subject. The number of class images should match the configured prior preservation count.

Key considerations:

  • Prior preservation prevents language drift and overfitting to the few training images
  • Class images are generated from the class prompt (e.g., "a photo of a dog")
  • Typically 100-200 class images are sufficient for effective regularization
  • This step can be skipped if prior preservation is disabled

Step 4: Prepare Training Data

Create a DreamBooth dataset that loads the instance images (subject-specific training images) and optionally the class images for prior preservation. Apply image preprocessing transforms including resizing, center cropping, normalization, and random horizontal flipping for augmentation. Build a DataLoader with a custom collation function that batches pixel values, input IDs, and prior preservation flags.

Key considerations:

  • Instance images are the 3-10 images of the specific subject
  • All images are resized to the model's expected resolution (typically 512x512)
  • The instance prompt includes a unique identifier (e.g., "a photo of sks dog")
  • The custom collate function handles variable-length tokenized prompts

Step 5: Run Diffusion Training Loop

Execute the training loop via Accelerate. For each training step: encode images to latent space using the frozen VAE, sample random noise and timesteps, add noise to the latents according to the diffusion schedule, encode text prompts to conditioning vectors using the text encoder, predict the noise using the UNet conditioned on the text embeddings and timestep, and compute the MSE loss between predicted and actual noise. If prior preservation is enabled, add the prior preservation loss (MSE on class images) weighted by a configurable factor.

Key considerations:

  • The VAE encoder maps images from pixel space to latent space
  • Noise is added according to the DDPMScheduler's noise schedule
  • The UNet predicts the noise component (epsilon prediction)
  • Prior preservation loss weight (lambda) is typically 1.0
  • Gradient accumulation and mixed precision (fp16/bf16) are supported

Step 6: Save Adapter Weights

After training completes, save the PEFT adapter weights for the UNet (and text encoder if trained). The saved adapters are small files that can be loaded onto the base Stable Diffusion model for inference. Optionally push the adapter to the Hugging Face Hub for sharing. At inference time, load the base pipeline and apply the saved adapters to generate personalized images.

Key considerations:

  • UNet and text encoder adapters are saved separately
  • Adapter checkpoints are typically a few MB (vs. several GB for the full model)
  • The adapter can be loaded via PeftModel.from_pretrained at inference time
  • Multiple adapters for different subjects can be swapped without reloading the base model

Execution Diagram

GitHub URL

Workflow Repository