Principle:LaurentMazare Tch rs Stable Diffusion
| Knowledge Sources | |
|---|---|
| Domains | Generative Models, Diffusion Models, Text-to-Image, Computer Vision |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Latent diffusion models perform iterative denoising in a compressed latent space rather than pixel space, combining a text encoder for conditioning, a variational autoencoder for compression, and a UNet for noise prediction to generate high-resolution images efficiently.
Description
Diffusion models generate data by learning to reverse a gradual noising process. Starting from pure Gaussian noise, the model iteratively removes small amounts of noise until a clean sample emerges. While powerful, running this process directly in pixel space is computationally expensive for high-resolution images.
Stable Diffusion (more precisely, Latent Diffusion Models or LDMs) addresses this by operating in a compressed latent space. The architecture consists of four major components:
1. CLIP Text Encoder
The CLIP (Contrastive Language-Image Pre-training) text encoder converts a text prompt into a sequence of embedding vectors. These embeddings capture the semantic content of the prompt and serve as the conditioning signal that guides the diffusion process toward generating images matching the description.
The text is first tokenized into integer token IDs, then processed through a transformer encoder. The output is a sequence of hidden states (one per token), which are injected into the UNet via cross-attention at each denoising step.
2. Variational Autoencoder (VAE)
The VAE provides the bridge between pixel space and latent space:
- Encoder - Compresses a RGB image into a latent representation (a spatial compression factor of 8x in each dimension). This dramatically reduces the dimensionality the diffusion model must handle.
- Decoder - Reconstructs the full-resolution image from the latent representation after denoising is complete.
The VAE is trained separately and held fixed during diffusion model training. It learns a perceptually meaningful latent space where small changes in latent codes produce small perceptual changes in images.
3. UNet Denoiser
The UNet is the core of the diffusion model. It predicts the noise component in a noisy latent, conditioned on:
- The noisy latent at the current timestep.
- The timestep , encoded as a sinusoidal embedding.
- The text embeddings from CLIP, incorporated via cross-attention layers.
The UNet follows an encoder-decoder architecture with skip connections, residual blocks, self-attention layers (for spatial coherence), and cross-attention layers (for text conditioning).
4. Noise Scheduler
The scheduler controls the noise schedule and the sampling process. Two common schedulers are:
- DDPM (Denoising Diffusion Probabilistic Models) - The original formulation requiring many steps (typically 1000) for high quality.
- DDIM (Denoising Diffusion Implicit Models) - A non-Markovian variant that allows skipping steps, producing good results in 20-50 steps.
Usage
Latent diffusion models are used for:
- Text-to-image generation - Generating images from natural language descriptions.
- Image-to-image translation - Modifying existing images guided by text prompts (by starting from a partially-noised encoding of the input image).
- Inpainting - Filling in masked regions of an image conditioned on surrounding context and text.
- Super-resolution - Enhancing low-resolution images to higher resolution.
- Controlled generation - Using techniques like classifier-free guidance to adjust the strength of text conditioning.
Theoretical Basis
Forward Diffusion Process
The forward process gradually adds Gaussian noise to a data sample over timesteps according to a variance schedule :
Using the cumulative product , any timestep can be sampled directly:
Reverse Process (Denoising)
The model learns the reverse transition:
Rather than predicting directly, the model predicts the noise (where is the text conditioning), and the mean is computed as:
Training Objective
The simplified training loss is:
where is sampled uniformly from , , and is constructed from and using the forward process formula.
DDIM Sampling
DDIM allows deterministic sampling with fewer steps. Given a subsequence of timesteps where :
This is deterministic (no added noise), making the generation process reproducible for a given initial noise sample.
Classifier-Free Guidance
To strengthen text conditioning, classifier-free guidance interpolates between unconditional and conditional noise predictions:
where is the guidance scale (typically 7-8.5), is the text conditioning, and represents unconditional (empty prompt) generation. Higher guidance scales produce images that more closely match the text but with less diversity.
Latent Space Encoding/Decoding
The VAE encoder maps an image to latent parameters:
During inference (after denoising), the decoder reconstructs the image:
Complete Inference Pipeline
INPUT: text_prompt, num_steps, guidance_scale
// Text encoding
tokens = tokenize(text_prompt)
text_embeddings = clip_encoder(tokens)
uncond_embeddings = clip_encoder(tokenize(""))
// Initialize from noise latents = random_normal(shape=[1, 4, 64, 64]) timesteps = scheduler.get_timesteps(num_steps)
// Iterative denoising
FOR t IN timesteps:
// Classifier-free guidance: predict noise for both
noise_cond = unet(latents, t, text_embeddings)
noise_uncond = unet(latents, t, uncond_embeddings)
noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
// Scheduler step
latents = scheduler.step(noise_pred, t, latents)
// Decode to pixel space image = vae_decoder(latents) image = postprocess(image) // Scale to [0, 255], convert to uint8
RETURN image