Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs Stable Diffusion

From Leeroopedia
Revision as of 17:47, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/LaurentMazare_Tch_rs_Stable_Diffusion.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Generative Models, Diffusion Models, Text-to-Image, Computer Vision
Last Updated 2026-02-08 00:00 GMT

Overview

Latent diffusion models perform iterative denoising in a compressed latent space rather than pixel space, combining a text encoder for conditioning, a variational autoencoder for compression, and a UNet for noise prediction to generate high-resolution images efficiently.

Description

Diffusion models generate data by learning to reverse a gradual noising process. Starting from pure Gaussian noise, the model iteratively removes small amounts of noise until a clean sample emerges. While powerful, running this process directly in pixel space is computationally expensive for high-resolution images.

Stable Diffusion (more precisely, Latent Diffusion Models or LDMs) addresses this by operating in a compressed latent space. The architecture consists of four major components:

1. CLIP Text Encoder

The CLIP (Contrastive Language-Image Pre-training) text encoder converts a text prompt into a sequence of embedding vectors. These embeddings capture the semantic content of the prompt and serve as the conditioning signal that guides the diffusion process toward generating images matching the description.

The text is first tokenized into integer token IDs, then processed through a transformer encoder. The output is a sequence of hidden states (one per token), which are injected into the UNet via cross-attention at each denoising step.

2. Variational Autoencoder (VAE)

The VAE provides the bridge between pixel space and latent space:

  • Encoder - Compresses a 512×512×3 RGB image into a 64×64×4 latent representation (a spatial compression factor of 8x in each dimension). This dramatically reduces the dimensionality the diffusion model must handle.
  • Decoder - Reconstructs the full-resolution image from the latent representation after denoising is complete.

The VAE is trained separately and held fixed during diffusion model training. It learns a perceptually meaningful latent space where small changes in latent codes produce small perceptual changes in images.

3. UNet Denoiser

The UNet is the core of the diffusion model. It predicts the noise component in a noisy latent, conditioned on:

  • The noisy latent zt at the current timestep.
  • The timestep t, encoded as a sinusoidal embedding.
  • The text embeddings from CLIP, incorporated via cross-attention layers.

The UNet follows an encoder-decoder architecture with skip connections, residual blocks, self-attention layers (for spatial coherence), and cross-attention layers (for text conditioning).

4. Noise Scheduler

The scheduler controls the noise schedule and the sampling process. Two common schedulers are:

  • DDPM (Denoising Diffusion Probabilistic Models) - The original formulation requiring many steps (typically 1000) for high quality.
  • DDIM (Denoising Diffusion Implicit Models) - A non-Markovian variant that allows skipping steps, producing good results in 20-50 steps.

Usage

Latent diffusion models are used for:

  • Text-to-image generation - Generating images from natural language descriptions.
  • Image-to-image translation - Modifying existing images guided by text prompts (by starting from a partially-noised encoding of the input image).
  • Inpainting - Filling in masked regions of an image conditioned on surrounding context and text.
  • Super-resolution - Enhancing low-resolution images to higher resolution.
  • Controlled generation - Using techniques like classifier-free guidance to adjust the strength of text conditioning.

Theoretical Basis

Forward Diffusion Process

The forward process gradually adds Gaussian noise to a data sample z0 over T timesteps according to a variance schedule β1,β2,,βT:

q(zt|zt1)=𝒩(zt;1βtzt1,βtI)

Using the cumulative product α¯t=s=1t(1βs), any timestep can be sampled directly:

zt=α¯tz0+1α¯tϵ,ϵ𝒩(0,I)

Reverse Process (Denoising)

The model learns the reverse transition:

pθ(zt1|zt)=𝒩(zt1;μθ(zt,t),σt2I)

Rather than predicting μθ directly, the model predicts the noise ϵθ(zt,t,c) (where c is the text conditioning), and the mean is computed as:

μθ(zt,t)=11βt(ztβt1α¯tϵθ(zt,t,c))

Training Objective

The simplified training loss is:

L=𝔼z0,ϵ,t[ϵϵθ(zt,t,c)2]

where t is sampled uniformly from {1,,T}, ϵ𝒩(0,I), and zt is constructed from z0 and ϵ using the forward process formula.

DDIM Sampling

DDIM allows deterministic sampling with fewer steps. Given a subsequence of timesteps τ1<τ2<<τS where ST:

zτi1=α¯τi1(zτi1α¯τiϵθ(zτi,τi,c)α¯τi)+1α¯τi1ϵθ(zτi,τi,c)

This is deterministic (no added noise), making the generation process reproducible for a given initial noise sample.

Classifier-Free Guidance

To strengthen text conditioning, classifier-free guidance interpolates between unconditional and conditional noise predictions:

ϵ^θ(zt,t,c)=ϵθ(zt,t,)+w(ϵθ(zt,t,c)ϵθ(zt,t,))

where w>1 is the guidance scale (typically 7-8.5), c is the text conditioning, and represents unconditional (empty prompt) generation. Higher guidance scales produce images that more closely match the text but with less diversity.

Latent Space Encoding/Decoding

The VAE encoder maps an image x to latent parameters:

z=μenc(x)+σenc(x)ϵ,ϵ𝒩(0,I)

During inference (after denoising), the decoder reconstructs the image:

x^=Dec(z0)

Complete Inference Pipeline

INPUT: text_prompt, num_steps, guidance_scale
// Text encoding
tokens = tokenize(text_prompt)
text_embeddings = clip_encoder(tokens)
uncond_embeddings = clip_encoder(tokenize(""))
// Initialize from noise
latents = random_normal(shape=[1, 4, 64, 64])
timesteps = scheduler.get_timesteps(num_steps)
// Iterative denoising
FOR t IN timesteps:
    // Classifier-free guidance: predict noise for both
    noise_cond = unet(latents, t, text_embeddings)
    noise_uncond = unet(latents, t, uncond_embeddings)
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
    // Scheduler step
    latents = scheduler.step(noise_pred, t, latents)
// Decode to pixel space
image = vae_decoder(latents)
image = postprocess(image)  // Scale to [0, 255], convert to uint8
RETURN image

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment