Principle:Huggingface Diffusers LoRA Adapter Injection
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Parameter_Efficient_Finetuning, LoRA |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Injecting low-rank adapter layers into frozen model weights enables parameter-efficient fine-tuning by training only a small number of additional parameters while preserving the pretrained model's capabilities.
Description
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that avoids modifying the original pretrained weights entirely. Instead, it injects pairs of small trainable matrices into specific layers of the model. The original weight matrix remains frozen, and only the low-rank decomposition is trained.
For a pretrained weight matrix W of shape (d, k), LoRA adds a parallel path through two smaller matrices: a down-projection A of shape (d, r) and an up-projection B of shape (r, k), where r (the rank) is much smaller than both d and k. During the forward pass, the output becomes h = Wx + BAx. During training, only A and B are updated.
In diffusion models, LoRA is typically applied to the attention layers of the UNet, specifically targeting the query, key, value, and output projection matrices. This targets the cross-attention layers (which condition on text) and self-attention layers (which model spatial relationships), allowing the model to learn new visual concepts or styles while retaining its general generation ability.
The lora_alpha parameter controls the scaling of the adapter output. The effective scaling factor is lora_alpha / r, which determines how much influence the adapter has relative to the frozen weights. A common practice is to set lora_alpha = r, yielding a scaling factor of 1.0.
Usage
Use LoRA adapter injection when:
- Fine-tuning diffusion models on custom datasets with limited GPU memory
- Training personalized models (e.g., DreamBooth-style with LoRA)
- You want to produce small, shareable adapter files (typically 3-50 MB) rather than full model checkpoints (2-7 GB)
- You need to combine multiple fine-tuned behaviors by loading and merging multiple adapters
Theoretical Basis
Low-Rank Decomposition
The core insight of LoRA is that the weight update during fine-tuning has low intrinsic rank. For a pretrained weight matrix W_0:
W = W_0 + delta_W
delta_W = B * A where B in R^{d x r}, A in R^{r x k}, r << min(d, k)
Forward pass:
h = W_0 * x + (lora_alpha / r) * B * A * x
The number of trainable parameters per adapted layer is:
params_lora = r * (d + k)
params_full = d * k
Compression ratio = params_lora / params_full = r * (d + k) / (d * k)
Example: d=k=1024, r=4:
params_lora = 4 * 2048 = 8,192
params_full = 1,048,576
Compression ratio = 0.78% (128x fewer parameters)
Initialization
Matrix A is initialized from a Gaussian distribution (or Kaiming uniform), and matrix B is initialized to zero. This ensures that at the start of training, delta_W = B * A = 0, so the model output is identical to the pretrained model:
A ~ N(0, sigma^2) (or Kaiming uniform)
B = 0
At initialization: delta_W = 0 * A = 0
Therefore: W = W_0 + 0 = W_0
Target Modules in UNet
For Stable Diffusion's UNet, LoRA is typically applied to the attention projection matrices:
Target modules: ["to_k", "to_q", "to_v", "to_out.0"]
These correspond to:
to_q: Query projection in self/cross-attention
to_k: Key projection in self/cross-attention
to_v: Value projection in self/cross-attention
to_out.0: Output projection in self/cross-attention