Principle:Huggingface Diffusers Training Model Loading
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Model_Loading, Transfer_Learning |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Loading model components individually for fine-tuning workflows enables selective weight freezing and targeted parameter-efficient training of specific subnetworks within a diffusion pipeline.
Description
A diffusion pipeline consists of multiple distinct components: a variational autoencoder (VAE) for encoding images to and from latent space, a denoising network (typically a UNet) that predicts noise, a text encoder that produces conditioning embeddings, a tokenizer that converts text to tokens, and a noise scheduler that controls the diffusion process. For fine-tuning, these components must be loaded individually rather than through the pipeline interface so that each can be independently configured.
Component-level loading means instantiating each model class separately using from_pretrained with the appropriate subfolder argument. A Stable Diffusion checkpoint on the Hub is organized as a directory with subfolders: unet/, vae/, text_encoder/, tokenizer/, and scheduler/. Each component class knows how to load its own weights from the corresponding subfolder.
Weight freezing is the practice of disabling gradient computation for model parameters that should not be updated during training. In LoRA fine-tuning, all original model weights are frozen (requires_grad_(False)) and only the newly injected LoRA adapter parameters are trainable. This dramatically reduces memory requirements and prevents catastrophic forgetting of the pretrained knowledge.
Subfolder-based organization allows a single model repository to contain all components of a pipeline. Each component is stored in its own subfolder with its own config and weight files, enabling independent versioning and loading.
Usage
Use component-level model loading when:
- Fine-tuning only specific components (e.g., UNet for LoRA, text encoder for textual inversion)
- You need to freeze certain components while training others
- Loading models with specific dtype or device placement for memory efficiency
- Using different revisions or variants for different components
Theoretical Basis
Transfer Learning and Weight Freezing
In transfer learning, a pretrained model provides a strong initialization. The key decision is which parameters to update:
Full fine-tuning: theta_new = theta_pretrained - lr * grad(L, theta_pretrained)
(all parameters updated, high memory, risk of forgetting)
Frozen + adapters: theta_pretrained is fixed (requires_grad = False)
theta_adapter = theta_adapter - lr * grad(L, theta_adapter)
(only adapter parameters updated, low memory, preserves knowledge)
Component Architecture of Stable Diffusion
The Stable Diffusion pipeline operates in a latent space:
Text Input --> [Tokenizer] --> [Text Encoder] --> text_embeddings
Image Input --> [VAE Encoder] --> latent (z)
Training: z + noise --> [UNet(z_noisy, t, text_embeddings)] --> noise_pred
loss = MSE(noise_pred, noise)
Inference: random noise --> [iterative UNet denoising] --> z_clean
z_clean --> [VAE Decoder] --> generated image
For LoRA training, only the UNet receives adapter layers. The VAE and text encoder remain frozen and are cast to half precision to save memory.