Principle:Huggingface Diffusers Dual LoRA Configuration
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A design principle for configuring Low-Rank Adaptation (LoRA) adapters on both the UNet denoising network and the text encoder simultaneously. Dual LoRA configuration enables stronger concept binding during DreamBooth personalization by allowing the model to learn subject-specific representations in both the visual and textual pathways.
Description
Standard DreamBooth LoRA applies adapters only to the UNet's cross-attention layers, which modify how the denoising network responds to text conditioning. However, for challenging subjects -- especially faces, artistic styles, or abstract concepts -- training adapters on both the UNet and the text encoder yields significantly better results.
The dual LoRA configuration involves:
- UNet LoRA -- Adapters are injected into the UNet's attention projection layers:
to_k,to_q,to_v,to_out.0(self-attention and cross-attention), plusadd_k_projandadd_v_proj(added cross-attention projections in certain architectures). - Text encoder LoRA -- Adapters are injected into the text encoder's self-attention layers:
q_proj,k_proj,v_proj,out_proj.
Each LoRA adapter introduces a pair of low-rank matrices A and B that modify the original weight matrix: W' = W + alpha/r * B @ A, where r is the rank and alpha is the scaling factor.
Usage
Configure dual LoRA when:
- Fine-grained concept binding is needed -- the text encoder LoRA helps the model associate the identifier token with the subject's visual features more strongly.
- Text encoder training is enabled via
--train_text_encoder. - The subject has distinctive visual characteristics that benefit from modified text embeddings (e.g., specific faces, unique art styles).
When text encoder LoRA is not used, the text encoder remains fully frozen and only the UNet adapters are trained.
Theoretical Basis
Dual LoRA for DreamBooth extends the standard LoRA formulation to a multi-component adaptation setting:
SINGLE-COMPONENT LoRA (UNet only):
W'_unet = W_unet + (alpha/r) * B_unet @ A_unet
Trainable params: { A_unet, B_unet } for each target module
DUAL-COMPONENT LoRA (UNet + Text Encoder):
W'_unet = W_unet + (alpha/r) * B_unet @ A_unet
W'_text = W_text + (alpha/r) * B_text @ A_text
Trainable params: { A_unet, B_unet, A_text, B_text } for each target module
TARGET MODULE SELECTION:
UNet targets: ["to_k", "to_q", "to_v", "to_out.0", "add_k_proj", "add_v_proj"]
Text encoder targets: ["q_proj", "k_proj", "v_proj", "out_proj"]
PARAMETER COUNT (rank r=4):
UNet LoRA: ~1.6M trainable params (out of ~860M total)
Text encoder LoRA: ~0.3M trainable params (out of ~123M total)
Total: ~1.9M trainable params (~0.2% of full model)
Key theoretical properties:
- Text encoder concept binding -- LoRA on the text encoder modifies how the identifier token is embedded, creating a more subject-specific text representation that propagates through cross-attention to the UNet.
- Target module selection -- Only attention projection layers are targeted because these are the critical interaction points between text and visual features. Feed-forward and normalization layers are left frozen, as they encode more general transformations.
- Rank and alpha coupling -- In the DreamBooth implementation,
lora_alphais set equal torank, meaning the effective scaling factor isalpha/r = 1.0. This avoids the need for separate alpha tuning. - Gaussian initialization -- LoRA weights are initialized with
init_lora_weights="gaussian", using small random values rather than the standard zero-initialization, which has been found to work better for DreamBooth personalization.