Principle:Huggingface Diffusers Training Model Loading

Knowledge Sources	Diffusers Model Loading Diffusers Training Overview
Domains	Diffusion_Models, Model_Loading, Transfer_Learning
Last Updated	2026-02-13 21:00 GMT

Overview

Loading model components individually for fine-tuning workflows enables selective weight freezing and targeted parameter-efficient training of specific subnetworks within a diffusion pipeline.

Description

A diffusion pipeline consists of multiple distinct components: a variational autoencoder (VAE) for encoding images to and from latent space, a denoising network (typically a UNet) that predicts noise, a text encoder that produces conditioning embeddings, a tokenizer that converts text to tokens, and a noise scheduler that controls the diffusion process. For fine-tuning, these components must be loaded individually rather than through the pipeline interface so that each can be independently configured.

Component-level loading means instantiating each model class separately using from_pretrained with the appropriate subfolder argument. A Stable Diffusion checkpoint on the Hub is organized as a directory with subfolders: unet/, vae/, text_encoder/, tokenizer/, and scheduler/. Each component class knows how to load its own weights from the corresponding subfolder.

Weight freezing is the practice of disabling gradient computation for model parameters that should not be updated during training. In LoRA fine-tuning, all original model weights are frozen (requires_grad_(False)) and only the newly injected LoRA adapter parameters are trainable. This dramatically reduces memory requirements and prevents catastrophic forgetting of the pretrained knowledge.

Subfolder-based organization allows a single model repository to contain all components of a pipeline. Each component is stored in its own subfolder with its own config and weight files, enabling independent versioning and loading.

Usage

Use component-level model loading when:

Fine-tuning only specific components (e.g., UNet for LoRA, text encoder for textual inversion)
You need to freeze certain components while training others
Loading models with specific dtype or device placement for memory efficiency
Using different revisions or variants for different components

Theoretical Basis

Transfer Learning and Weight Freezing

In transfer learning, a pretrained model provides a strong initialization. The key decision is which parameters to update:

Full fine-tuning:     theta_new = theta_pretrained - lr * grad(L, theta_pretrained)
                      (all parameters updated, high memory, risk of forgetting)

Frozen + adapters:    theta_pretrained is fixed (requires_grad = False)
                      theta_adapter = theta_adapter - lr * grad(L, theta_adapter)
                      (only adapter parameters updated, low memory, preserves knowledge)

Component Architecture of Stable Diffusion

The Stable Diffusion pipeline operates in a latent space:

Text Input  -->  [Tokenizer] --> [Text Encoder] --> text_embeddings
Image Input -->  [VAE Encoder] --> latent (z)

Training:   z + noise --> [UNet(z_noisy, t, text_embeddings)] --> noise_pred
            loss = MSE(noise_pred, noise)

Inference:  random noise --> [iterative UNet denoising] --> z_clean
            z_clean --> [VAE Decoder] --> generated image

For LoRA training, only the UNet receives adapter layers. The VAE and text encoder remain frozen and are cast to half precision to save memory.

Related Pages

Implemented By

Implementation:Huggingface_Diffusers_ModelMixin_From_Pretrained

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment