Principle:LaurentMazare Tch rs Stable Diffusion

Knowledge Sources	LaurentMazare_Tch_rs Image Synthesis with Latent Diffusion Models Rombach et al., 2022 Diffusion Probabilistic Models Ho et al., 2020 Diffusion Implicit Models Song et al., 2021 Transferable Visual Models From Natural Language Supervision Radford et al., 2021
Domains	Generative Models, Diffusion Models, Text-to-Image, Computer Vision
Last Updated	2026-02-08 00:00 GMT

Overview

Latent diffusion models perform iterative denoising in a compressed latent space rather than pixel space, combining a text encoder for conditioning, a variational autoencoder for compression, and a UNet for noise prediction to generate high-resolution images efficiently.

Description

Diffusion models generate data by learning to reverse a gradual noising process. Starting from pure Gaussian noise, the model iteratively removes small amounts of noise until a clean sample emerges. While powerful, running this process directly in pixel space is computationally expensive for high-resolution images.

Stable Diffusion (more precisely, Latent Diffusion Models or LDMs) addresses this by operating in a compressed latent space. The architecture consists of four major components:

1. CLIP Text Encoder

The CLIP (Contrastive Language-Image Pre-training) text encoder converts a text prompt into a sequence of embedding vectors. These embeddings capture the semantic content of the prompt and serve as the conditioning signal that guides the diffusion process toward generating images matching the description.

The text is first tokenized into integer token IDs, then processed through a transformer encoder. The output is a sequence of hidden states (one per token), which are injected into the UNet via cross-attention at each denoising step.

2. Variational Autoencoder (VAE)

The VAE provides the bridge between pixel space and latent space:

Encoder - Compresses a $512 \times 512 \times 3$ RGB image into a $64 \times 64 \times 4$ latent representation (a spatial compression factor of 8x in each dimension). This dramatically reduces the dimensionality the diffusion model must handle.
Decoder - Reconstructs the full-resolution image from the latent representation after denoising is complete.

The VAE is trained separately and held fixed during diffusion model training. It learns a perceptually meaningful latent space where small changes in latent codes produce small perceptual changes in images.

3. UNet Denoiser

The UNet is the core of the diffusion model. It predicts the noise component in a noisy latent, conditioned on:

The noisy latent $z_{t}$ at the current timestep.
The timestep $t$ , encoded as a sinusoidal embedding.
The text embeddings from CLIP, incorporated via cross-attention layers.

The UNet follows an encoder-decoder architecture with skip connections, residual blocks, self-attention layers (for spatial coherence), and cross-attention layers (for text conditioning).

4. Noise Scheduler

The scheduler controls the noise schedule and the sampling process. Two common schedulers are:

DDPM (Denoising Diffusion Probabilistic Models) - The original formulation requiring many steps (typically 1000) for high quality.
DDIM (Denoising Diffusion Implicit Models) - A non-Markovian variant that allows skipping steps, producing good results in 20-50 steps.

Usage

Latent diffusion models are used for:

Text-to-image generation - Generating images from natural language descriptions.
Image-to-image translation - Modifying existing images guided by text prompts (by starting from a partially-noised encoding of the input image).
Inpainting - Filling in masked regions of an image conditioned on surrounding context and text.
Super-resolution - Enhancing low-resolution images to higher resolution.
Controlled generation - Using techniques like classifier-free guidance to adjust the strength of text conditioning.

Theoretical Basis

Forward Diffusion Process

The forward process gradually adds Gaussian noise to a data sample $z_{0}$ over $T$ timesteps according to a variance schedule $β_{1}, β_{2}, \dots, β_{T}$ :

$q (z_{t} | z_{t - 1}) = 𝒩 (z_{t}; \sqrt{1 - β_{t}} z_{t - 1}, β_{t} I)$

Using the cumulative product ${\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s})$ , any timestep can be sampled directly:

$z_{t} = \sqrt{{\bar{α}}_{t}} z_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim 𝒩 (0, I)$

Reverse Process (Denoising)

The model learns the reverse transition:

$p_{θ} (z_{t - 1} | z_{t}) = 𝒩 (z_{t - 1}; μ_{θ} (z_{t}, t), σ_{t}^{2} I)$

Rather than predicting $μ_{θ}$ directly, the model predicts the noise $ϵ_{θ} (z_{t}, t, c)$ (where $c$ is the text conditioning), and the mean is computed as:

$μ_{θ} (z_{t}, t) = \frac{1}{\sqrt{1 - β_{t}}} (z_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (z_{t}, t, c))$

Training Objective

The simplified training loss is:

$L = 𝔼_{z_{0}, ϵ, t} [‖ ϵ - ϵ_{θ} (z_{t}, t, c) ‖^{2}]$

where $t$ is sampled uniformly from ${1, \dots, T}$ , $ϵ \sim 𝒩 (0, I)$ , and $z_{t}$ is constructed from $z_{0}$ and $ϵ$ using the forward process formula.

DDIM Sampling

DDIM allows deterministic sampling with fewer steps. Given a subsequence of timesteps $τ_{1} < τ_{2} < \dots < τ_{S}$ where $S ≪ T$ :

$z_{τ_{i - 1}} = \sqrt{{\bar{α}}_{τ_{i - 1}}} (\frac{z_{τ_{i}} - \sqrt{1 - {\bar{α}}_{τ_{i}}} ϵ_{θ} (z_{τ_{i}}, τ_{i}, c)}{\sqrt{{\bar{α}}_{τ_{i}}}}) + \sqrt{1 - {\bar{α}}_{τ_{i - 1}}} ϵ_{θ} (z_{τ_{i}}, τ_{i}, c)$

This is deterministic (no added noise), making the generation process reproducible for a given initial noise sample.

Classifier-Free Guidance

To strengthen text conditioning, classifier-free guidance interpolates between unconditional and conditional noise predictions:

${\hat{ϵ}}_{θ} (z_{t}, t, c) = ϵ_{θ} (z_{t}, t, \emptyset) + w \cdot (ϵ_{θ} (z_{t}, t, c) - ϵ_{θ} (z_{t}, t, \emptyset))$

where $w > 1$ is the guidance scale (typically 7-8.5), $c$ is the text conditioning, and $\emptyset$ represents unconditional (empty prompt) generation. Higher guidance scales produce images that more closely match the text but with less diversity.

Latent Space Encoding/Decoding

The VAE encoder maps an image $x$ to latent parameters:

$z = μ_{enc} (x) + σ_{enc} (x) ⊙ ϵ, ϵ \sim 𝒩 (0, I)$

During inference (after denoising), the decoder reconstructs the image:

$\hat{x} = Dec (z_{0})$

Complete Inference Pipeline

INPUT: text_prompt, num_steps, guidance_scale

// Text encoding
tokens = tokenize(text_prompt)
text_embeddings = clip_encoder(tokens)
uncond_embeddings = clip_encoder(tokenize(""))

// Initialize from noise
latents = random_normal(shape=[1, 4, 64, 64])
timesteps = scheduler.get_timesteps(num_steps)

// Iterative denoising
FOR t IN timesteps:
    // Classifier-free guidance: predict noise for both
    noise_cond = unet(latents, t, text_embeddings)
    noise_uncond = unet(latents, t, uncond_embeddings)
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)

    // Scheduler step
    latents = scheduler.step(noise_pred, t, latents)

// Decode to pixel space
image = vae_decoder(latents)
image = postprocess(image)  // Scale to [0, 255], convert to uint8

RETURN image

Related Pages

Implementation:LaurentMazare_Tch_rs_Stable_Diffusion_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment