Workflow:Huggingface Diffusers DreamBooth Personalization
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Fine_Tuning, Personalization |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
End-to-end process for personalizing a diffusion model to generate images of a specific subject using just a few reference images, with optional LoRA for parameter efficiency.
Description
This workflow implements DreamBooth, a technique for teaching a pretrained diffusion model to associate a unique identifier token with a specific subject (person, object, pet, style). Given as few as 3-5 images of the subject, the model learns to generate that subject in novel contexts and poses. The workflow uses LoRA adapters for memory-efficient training and includes prior preservation regularization to prevent the model from forgetting its general knowledge. Both the UNet and the text encoder can optionally be adapted with separate LoRA configurations. The result is a small adapter file that, when loaded, enables the base model to generate the personalized subject on demand.
Usage
Execute this workflow when you have a small set of images (typically 3-10) depicting a specific subject and want the model to generate new images of that subject in different contexts. Common use cases include creating personalized avatars, generating product images in various settings, or teaching the model to reproduce a specific art style from a few examples. This differs from general LoRA fine-tuning in that it targets a single concept with minimal data rather than adapting to a broad dataset.
Execution Steps
Step 1: Instance Data Collection
Gather a small set of high-quality reference images of the target subject. These images should show the subject from different angles and lighting conditions while maintaining consistent identity. Choose a unique identifier token (e.g., "sks") that will be bound to the subject during training.
Key considerations:
- 3-10 images typically suffice; more is not always better
- Images should be diverse in pose and background but consistent in the subject
- Choose an identifier token that is rare in the model's vocabulary to avoid conflicts
- Construct an instance prompt following the pattern: "a photo of [identifier] [class]"
- The class word (e.g., "dog", "person") anchors the model's prior knowledge
Step 2: Prior Preservation Generation
Generate regularization images using the base model with the class prompt (without the identifier). These class images serve as a regularization signal during training, preventing the model from collapsing the entire class concept into the specific subject. The model generates these images once before training begins.
Key considerations:
- Typically generate 100-200 class images
- Use the class prompt only (e.g., "a photo of a dog") without the identifier
- Images are generated at the same resolution as training images
- This step can be skipped but significantly degrades output diversity
- Generated images are cached and reused across training runs
Step 3: Model Loading and Freezing
Load all pretrained model components: tokenizer, text encoder, VAE, UNet, and noise scheduler. The loader dynamically selects the correct text encoder class based on the model architecture. Freeze all model parameters to prepare for LoRA injection.
Key considerations:
- Supports multiple model families (Stable Diffusion, IF, etc.) through dynamic class selection
- Models without a VAE (e.g., IF) skip VAE loading gracefully
- All components are cast to the appropriate dtype for memory efficiency
- The text encoder can optionally be included in training for stronger concept binding
Step 4: Dual LoRA Configuration
Configure and inject LoRA adapters into both the UNet and optionally the text encoder. The UNet adapter targets attention and cross-attention projection layers including the added key/value projections. The text encoder adapter targets its internal attention layers for stronger concept association.
Key considerations:
- UNet targets include standard attention projections plus add_k_proj and add_v_proj
- Text encoder targets include q_proj, k_proj, v_proj, and out_proj
- Separate rank and alpha values can be set for UNet and text encoder adapters
- Training the text encoder improves concept fidelity but increases memory usage
- Cast all trainable parameters to float32 for mixed-precision stability
Step 5: Dataset Construction
Build the DreamBooth dataset combining instance images (subject photos) with class images (regularization). The dataset handles pairing each image with its corresponding prompt, applying image transformations, and managing the instance/class split within each batch.
Key considerations:
- Instance and class images are interleaved in training batches
- Text embeddings can be pre-computed to save memory when not training the text encoder
- Image augmentations include resize, crop, and optional horizontal flip
- The dataset returns paired pixel values and text encodings for both instance and class samples
Step 6: Training Loop with Prior Preservation
Execute the training loop with the combined instance and regularization loss. For each batch: encode images to latents, sample noise and timesteps, predict noise with the adapted model, then compute separate losses for instance and class samples. The total loss is a weighted combination that balances concept learning with prior preservation.
Key considerations:
- Instance loss teaches the model the new concept
- Prior preservation loss prevents catastrophic forgetting of the class concept
- The prior loss weight (default 1.0) balances concept learning against preservation
- The batch is split: first half contains instance samples, second half contains class samples
- Training typically requires 400-1200 steps depending on the number of instance images
Step 7: Validation and Export
Generate validation images using the trained adapter to verify concept learning. Save both UNet and text encoder LoRA weights in the diffusers-compatible format. Optionally push the trained adapter to the Hugging Face Hub.
Key considerations:
- Validation images can be conditioned on both prompts and reference images
- Both UNet and text encoder adapter weights are saved together
- The combined adapter file is typically a few MB
- The adapter can be loaded into any compatible base model pipeline