Principle:Huggingface Diffusers LoRA Adapter Injection

Knowledge Sources	LoRA: Low-Rank Adaptation of Large Language Models PEFT Documentation Diffusers LoRA Guide
Domains	Diffusion_Models, Parameter_Efficient_Finetuning, LoRA
Last Updated	2026-02-13 21:00 GMT

Overview

Injecting low-rank adapter layers into frozen model weights enables parameter-efficient fine-tuning by training only a small number of additional parameters while preserving the pretrained model's capabilities.

Description

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that avoids modifying the original pretrained weights entirely. Instead, it injects pairs of small trainable matrices into specific layers of the model. The original weight matrix remains frozen, and only the low-rank decomposition is trained.

For a pretrained weight matrix W of shape (d, k), LoRA adds a parallel path through two smaller matrices: a down-projection A of shape (d, r) and an up-projection B of shape (r, k), where r (the rank) is much smaller than both d and k. During the forward pass, the output becomes h = Wx + BAx. During training, only A and B are updated.

In diffusion models, LoRA is typically applied to the attention layers of the UNet, specifically targeting the query, key, value, and output projection matrices. This targets the cross-attention layers (which condition on text) and self-attention layers (which model spatial relationships), allowing the model to learn new visual concepts or styles while retaining its general generation ability.

The lora_alpha parameter controls the scaling of the adapter output. The effective scaling factor is lora_alpha / r, which determines how much influence the adapter has relative to the frozen weights. A common practice is to set lora_alpha = r, yielding a scaling factor of 1.0.

Usage

Use LoRA adapter injection when:

Fine-tuning diffusion models on custom datasets with limited GPU memory
Training personalized models (e.g., DreamBooth-style with LoRA)
You want to produce small, shareable adapter files (typically 3-50 MB) rather than full model checkpoints (2-7 GB)
You need to combine multiple fine-tuned behaviors by loading and merging multiple adapters

Theoretical Basis

Low-Rank Decomposition

The core insight of LoRA is that the weight update during fine-tuning has low intrinsic rank. For a pretrained weight matrix W_0:

W = W_0 + delta_W
delta_W = B * A          where B in R^{d x r}, A in R^{r x k}, r << min(d, k)

Forward pass:
h = W_0 * x + (lora_alpha / r) * B * A * x

The number of trainable parameters per adapted layer is:

params_lora = r * (d + k)
params_full = d * k

Compression ratio = params_lora / params_full = r * (d + k) / (d * k)

Example: d=k=1024, r=4:
  params_lora = 4 * 2048 = 8,192
  params_full = 1,048,576
  Compression ratio = 0.78% (128x fewer parameters)

Initialization

Matrix A is initialized from a Gaussian distribution (or Kaiming uniform), and matrix B is initialized to zero. This ensures that at the start of training, delta_W = B * A = 0, so the model output is identical to the pretrained model:

A ~ N(0, sigma^2)     (or Kaiming uniform)
B = 0

At initialization: delta_W = 0 * A = 0
Therefore: W = W_0 + 0 = W_0

Target Modules in UNet

For Stable Diffusion's UNet, LoRA is typically applied to the attention projection matrices:

Target modules: ["to_k", "to_q", "to_v", "to_out.0"]

These correspond to:
  to_q: Query projection in self/cross-attention
  to_k: Key projection in self/cross-attention
  to_v: Value projection in self/cross-attention
  to_out.0: Output projection in self/cross-attention

Related Pages

Implemented By

Implementation:Huggingface_Diffusers_PeftAdapterMixin_Add_Adapter

Uses Heuristic

Heuristic:Huggingface_Diffusers_LoRA_Safe_Fusing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment