Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Implementations

Knowledge Sources	AUTOMATIC1111_Stable_diffusion_webui
Domains	Diffusion Models, VAE, Stable Diffusion 3, Sampling
Last Updated	2025-05-15 00:00 GMT

Overview

Implements the core SD3 diffusion model wrapper, VAE (encoder/decoder), latent format handling, CFG denoiser, and Euler sampling for Stable Diffusion 3's discrete flow matching framework.

Description

This module provides the complete SD3 diffusion pipeline through the following components:

Model Wrapping:

ModelSamplingDiscreteFlow: Handles sigma/timestep scheduling for discrete flow matching models. Converts between timesteps and sigmas using a configurable shift parameter, and provides methods for calculating denoised outputs and noise scaling.
BaseModel: Wraps the MM-DiT backbone, automatically inferring model configuration (patch size, depth, num_patches, adm_in_channels) from state_dict tensor shapes. Routes forward calls through the diffusion model with proper dtype casting.

CFG and Sampling:

CFGDenoiser: Applies classifier-free guidance by batching conditional and unconditional model passes together and computing the scaled difference.
sample_euler: Implements the Euler sampling method (Algorithm 2 from Karras et al. 2022) for discrete flow matching, with float16 autocast.
append_dims and to_d: Utility functions for Karras ODE derivative computation.

Latent Format:

SD3LatentFormat: Handles the scale/shift correction for SD3's 16-channel latent space (scale_factor=1.5305, shift_factor=0.0609). Includes an RGB preview decoder using a fixed 16x3 factor matrix for quick latent visualization.

VAE:

ResnetBlock, AttnBlock, Downsample, Upsample: Building blocks for the VAE architecture.
VAEEncoder: Convolutional encoder with downsampling blocks, mid-block attention, producing mean and logvar for 16-channel latent space.
VAEDecoder: Mirror architecture with upsampling blocks for decoding latents back to pixel space.
SDVAE: Top-level VAE combining encoder and decoder with float16 autocast for both encode and decode operations.

Usage

Use this module as the core diffusion and VAE infrastructure for Stable Diffusion 3. The BaseModel wraps the MM-DiT for denoising, the VAE handles encoding/decoding between pixel and latent space, and the sampling utilities provide the generation loop.

Code Reference

Source Location

Repository: AUTOMATIC1111_Stable_diffusion_webui
File: modules/models/sd3/sd3_impls.py
Lines: 1-374

Signature

class BaseModel(torch.nn.Module):
    def __init__(self, shift=1.0, device=None, dtype=torch.float32,
                 state_dict=None, prefix=""):
    def apply_model(self, x, sigma, c_crossattn=None, y=None):

class CFGDenoiser(torch.nn.Module):
    def __init__(self, model):
    def forward(self, x, timestep, cond, uncond, cond_scale):

class SDVAE(torch.nn.Module):
    def __init__(self, dtype=torch.float32, device=None):
    def decode(self, latent):
    def encode(self, image):

class ModelSamplingDiscreteFlow(torch.nn.Module):
    def __init__(self, shift=1.0):
    def sigma(self, timestep: torch.Tensor):
    def calculate_denoised(self, sigma, model_output, model_input):

class SD3LatentFormat:
    def __init__(self):
    def process_in(self, latent):
    def process_out(self, latent):
    def decode_latent_to_preview(self, x0):

Import

from modules.models.sd3.sd3_impls import BaseModel, CFGDenoiser, SDVAE, SD3LatentFormat

I/O Contract

Inputs

Name	Type	Required	Description
x	torch.Tensor	Yes	Noisy latent tensor (N, 16, H, W)
sigma	torch.Tensor	Yes	Current noise level sigma for the denoising step
c_crossattn	torch.Tensor	No	Cross-attention conditioning from text encoders
y	torch.Tensor	No	Pooled vector conditioning
image	torch.Tensor	No	Pixel-space image for VAE encoding (N, 3, H, W)
latent	torch.Tensor	No	Latent tensor for VAE decoding (N, 16, H, W)

Outputs

Name	Type	Description
denoised	torch.Tensor	Denoised latent prediction from the model
decoded	torch.Tensor	Pixel-space image output from VAE decoder (N, 3, H, W)
encoded	torch.Tensor	Sampled latent encoding from VAE encoder (N, 16, H, W)

Usage Examples

from modules.models.sd3.sd3_impls import BaseModel, CFGDenoiser, SDVAE, SD3LatentFormat

# Create the base model from a state dict
base_model = BaseModel(shift=3.0, device="cuda", dtype=torch.float16,
                       state_dict=state_dict, prefix="model.diffusion_model.")

# Create CFG denoiser wrapper
cfg_model = CFGDenoiser(base_model)

# Apply CFG denoising
denoised = cfg_model(x_noisy, timestep, cond=cond_dict, uncond=uncond_dict,
                     cond_scale=7.0)

# VAE encode/decode
vae = SDVAE(dtype=torch.float32, device="cuda")
latent = vae.encode(image_tensor)
reconstructed = vae.decode(latent)

# Latent format correction
fmt = SD3LatentFormat()
latent_in = fmt.process_in(raw_latent)
latent_out = fmt.process_out(model_latent)

Related Pages

Principle:AUTOMATIC1111_Stable_diffusion_webui_SD3_Diffusion_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment