Principle:Huggingface Diffusers Model Freezing
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A design principle for loading pretrained diffusion model components and freezing their weights before injecting parameter-efficient adapters. Model freezing is the prerequisite step that enables LoRA-based fine-tuning by ensuring that only the newly added adapter parameters receive gradients during training.
Description
In DreamBooth LoRA training, the full diffusion pipeline consists of several large pretrained components:
- UNet2DConditionModel -- The denoising network (typically hundreds of millions of parameters).
- AutoencoderKL (VAE) -- The variational autoencoder for encoding/decoding between pixel and latent space.
- Text encoder -- The CLIP or similar text encoder that converts prompts to embeddings.
- DDPMScheduler -- The noise scheduler (no trainable parameters).
The model freezing principle dictates that all pretrained weights are loaded in evaluation mode and their gradients are disabled via requires_grad_(False) before any adapter layers are added. This achieves several goals:
- Memory efficiency -- Frozen parameters do not store gradient buffers, reducing GPU memory usage by approximately 50%.
- Training stability -- Only the small set of adapter parameters are updated, preventing catastrophic changes to the pretrained weights.
- Mixed precision compatibility -- Frozen weights can be cast to half-precision (fp16/bf16) for inference, while adapter weights remain in full precision for training stability.
Usage
Apply model freezing immediately after loading pretrained components and before adding LoRA adapters:
- Load all model components with
from_pretrained(). - Freeze all parameters with
model.requires_grad_(False). - Cast frozen models to the inference dtype (fp16/bf16).
- Move all models to the target device.
- Then add LoRA adapters (which will have
requires_grad=Trueby default).
Theoretical Basis
Model freezing is the foundation of transfer learning and parameter-efficient fine-tuning (PEFT). The core insight is that a model pretrained on a large dataset has already learned rich feature representations, and adapting it to a new task requires modifying only a small subset of parameters.
FREEZE-THEN-ADAPT:
theta_pretrained = load_pretrained(model_id)
For all p in theta_pretrained:
p.requires_grad = False # Freeze base model
theta_adapter = initialize_adapter(theta_pretrained)
# Only theta_adapter receives gradients
theta_total = theta_pretrained + theta_adapter
# Forward pass uses both; backward pass updates only theta_adapter
MEMORY ANALYSIS:
Base model: ~860M params (UNet) + ~123M (text_encoder) + ~83M (VAE)
LoRA adapters: ~1-4M params (rank 4, targeting attention layers)
Gradient memory: proportional to |theta_adapter| only
Key theoretical properties:
- Gradient flow control --
requires_grad_(False)prevents gradient computation and storage for frozen parameters, but the frozen weights still participate in the forward pass. Gradients flow through frozen layers to reach adapter parameters via the chain rule. - Weight preservation -- Frozen weights remain at their pretrained values throughout training, preserving the model's general capabilities.
- Dtype separation -- Frozen weights can use reduced precision (fp16/bf16) since they only participate in forward passes, while trainable adapter weights maintain full precision (fp32) for numerical stability during gradient updates.