Principle:AUTOMATIC1111 Stable diffusion webui Model and VAE loading
| Knowledge Sources | |
|---|---|
| Domains | Diffusion Models, Model Management, Deep Learning Infrastructure |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Model and VAE loading is the process of instantiating and initializing the multi-component Stable Diffusion architecture from serialized checkpoint files, configuring each sub-network (UNet, CLIP, VAE) for inference.
Description
A Stable Diffusion model is not a single neural network but a composite of three major components:
- UNet (Denoising Network) -- The core diffusion model that iteratively removes noise from latent representations. It uses cross-attention layers to incorporate text conditioning from CLIP. For SD1.x, this is approximately 860M parameters.
- CLIP Text Encoder -- A transformer-based text encoder (from OpenAI's CLIP) that converts text prompts into embedding vectors. SD1.x uses CLIP ViT-L/14 (768-dimensional embeddings), SD2.x uses OpenCLIP ViT-H/14 (1024-dimensional), and SDXL uses a dual encoder (CLIP ViT-L + OpenCLIP ViT-bigG, producing 1280-dimensional concatenated embeddings).
- VAE (Variational Autoencoder) -- Compresses pixel-space images (3 channels, RGB) into a compact latent space (4 channels) with a spatial downsampling factor of 8. The encoder is used for img2img; the decoder converts generated latents back to pixels.
Checkpoint Formats
Models are distributed in two primary formats:
.safetensors-- A safe, efficient binary format that does not allow arbitrary code execution. Preferred for security..ckpt-- A PyTorch pickle-based format that can contain arbitrary Python objects. Requires trust in the source.
The loading process must detect which model variant (SD1.x, SD2.x, SDXL, SDXL Refiner) is contained in the checkpoint by examining the state dictionary keys and shapes.
VAE as a Separate Component
The VAE can be loaded independently from the main checkpoint, allowing users to swap VAEs without reloading the entire model. This is useful because:
- Some community-trained VAEs produce better color reproduction or fewer artifacts
- The VAE can be switched to different precision (float32 vs float16 vs bfloat16) to avoid NaN issues
- VAE files are relatively small (~300MB) compared to the full checkpoint (~2-7GB)
Usage
Model loading occurs at application startup, when the user selects a different checkpoint from the UI, or when a generation request specifies a different model. The process must handle:
- Memory management (unloading the previous model to free VRAM)
- Automatic configuration detection
- Half-precision conversion for efficiency
- Model hijacking for custom embedding support
- Textual inversion embedding reloading
Theoretical Basis
Latent Diffusion Architecture
The Latent Diffusion Model (LDM) operates in a compressed latent space rather than pixel space:
Image (H, W, 3) --[VAE Encoder]--> Latent (H/8, W/8, 4) --[UNet Denoising]--> Denoised Latent --[VAE Decoder]--> Image (H, W, 3)
The UNet receives the noisy latent, a timestep embedding, and cross-attention conditioning from CLIP:
epsilon_theta(z_t, t, c) where:
z_t = noisy latent at timestep t
t = noise level / timestep
c = CLIP text embedding
Configuration Detection
The model variant is identified by examining specific weight tensor keys in the state dictionary:
- SD1 CLIP: presence of
cond_stage_model.transformer.text_model.embeddings.token_embedding.weight - SD2 CLIP: presence of the OpenCLIP weight key variant
- SDXL: presence of
conditioner.embedders.1.model.transformerkeys - SDXL Refiner: SDXL keys without the first conditioner embedder
The corresponding YAML configuration file is selected based on this detection, and the model is instantiated from that configuration.