Principle:AUTOMATIC1111 Stable diffusion webui Model and VAE loading

Knowledge Sources	High-Resolution Image Synthesis with Latent Diffusion Models Auto-Encoding Variational Bayes Learning Transferable Visual Models From Natural Language Supervision
Domains	Diffusion Models, Model Management, Deep Learning Infrastructure
Last Updated	2026-02-08 00:00 GMT

Overview

Model and VAE loading is the process of instantiating and initializing the multi-component Stable Diffusion architecture from serialized checkpoint files, configuring each sub-network (UNet, CLIP, VAE) for inference.

Description

A Stable Diffusion model is not a single neural network but a composite of three major components:

UNet (Denoising Network) -- The core diffusion model that iteratively removes noise from latent representations. It uses cross-attention layers to incorporate text conditioning from CLIP. For SD1.x, this is approximately 860M parameters.

CLIP Text Encoder -- A transformer-based text encoder (from OpenAI's CLIP) that converts text prompts into embedding vectors. SD1.x uses CLIP ViT-L/14 (768-dimensional embeddings), SD2.x uses OpenCLIP ViT-H/14 (1024-dimensional), and SDXL uses a dual encoder (CLIP ViT-L + OpenCLIP ViT-bigG, producing 1280-dimensional concatenated embeddings).

VAE (Variational Autoencoder) -- Compresses pixel-space images (3 channels, RGB) into a compact latent space (4 channels) with a spatial downsampling factor of 8. The encoder is used for img2img; the decoder converts generated latents back to pixels.

Checkpoint Formats

Models are distributed in two primary formats:

.safetensors -- A safe, efficient binary format that does not allow arbitrary code execution. Preferred for security.
.ckpt -- A PyTorch pickle-based format that can contain arbitrary Python objects. Requires trust in the source.

The loading process must detect which model variant (SD1.x, SD2.x, SDXL, SDXL Refiner) is contained in the checkpoint by examining the state dictionary keys and shapes.

VAE as a Separate Component

The VAE can be loaded independently from the main checkpoint, allowing users to swap VAEs without reloading the entire model. This is useful because:

Some community-trained VAEs produce better color reproduction or fewer artifacts
The VAE can be switched to different precision (float32 vs float16 vs bfloat16) to avoid NaN issues
VAE files are relatively small (~300MB) compared to the full checkpoint (~2-7GB)

Usage

Model loading occurs at application startup, when the user selects a different checkpoint from the UI, or when a generation request specifies a different model. The process must handle:

Memory management (unloading the previous model to free VRAM)
Automatic configuration detection
Half-precision conversion for efficiency
Model hijacking for custom embedding support
Textual inversion embedding reloading

Theoretical Basis

Latent Diffusion Architecture

The Latent Diffusion Model (LDM) operates in a compressed latent space rather than pixel space:

Image (H, W, 3) --[VAE Encoder]--> Latent (H/8, W/8, 4) --[UNet Denoising]--> Denoised Latent --[VAE Decoder]--> Image (H, W, 3)

The UNet receives the noisy latent, a timestep embedding, and cross-attention conditioning from CLIP:

epsilon_theta(z_t, t, c) where:
  z_t = noisy latent at timestep t
  t   = noise level / timestep
  c   = CLIP text embedding

Configuration Detection

The model variant is identified by examining specific weight tensor keys in the state dictionary:

SD1 CLIP: presence of cond_stage_model.transformer.text_model.embeddings.token_embedding.weight
SD2 CLIP: presence of the OpenCLIP weight key variant
SDXL: presence of conditioner.embedders.1.model.transformer keys
SDXL Refiner: SDXL keys without the first conditioner embedder

The corresponding YAML configuration file is selected based on this detection, and the model is instantiated from that configuration.

Related Pages

Implemented By

Implementation:AUTOMATIC1111_Stable_diffusion_webui_Load_model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment