Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Implementations
| Knowledge Sources | |
|---|---|
| Domains | Diffusion Models, VAE, Stable Diffusion 3, Sampling |
| Last Updated | 2025-05-15 00:00 GMT |
Overview
Implements the core SD3 diffusion model wrapper, VAE (encoder/decoder), latent format handling, CFG denoiser, and Euler sampling for Stable Diffusion 3's discrete flow matching framework.
Description
This module provides the complete SD3 diffusion pipeline through the following components:
Model Wrapping:
ModelSamplingDiscreteFlow: Handles sigma/timestep scheduling for discrete flow matching models. Converts between timesteps and sigmas using a configurable shift parameter, and provides methods for calculating denoised outputs and noise scaling.BaseModel: Wraps the MM-DiT backbone, automatically inferring model configuration (patch size, depth, num_patches, adm_in_channels) from state_dict tensor shapes. Routes forward calls through the diffusion model with proper dtype casting.
CFG and Sampling:
CFGDenoiser: Applies classifier-free guidance by batching conditional and unconditional model passes together and computing the scaled difference.sample_euler: Implements the Euler sampling method (Algorithm 2 from Karras et al. 2022) for discrete flow matching, with float16 autocast.append_dimsandto_d: Utility functions for Karras ODE derivative computation.
Latent Format:
SD3LatentFormat: Handles the scale/shift correction for SD3's 16-channel latent space (scale_factor=1.5305, shift_factor=0.0609). Includes an RGB preview decoder using a fixed 16x3 factor matrix for quick latent visualization.
VAE:
ResnetBlock,AttnBlock,Downsample,Upsample: Building blocks for the VAE architecture.VAEEncoder: Convolutional encoder with downsampling blocks, mid-block attention, producing mean and logvar for 16-channel latent space.VAEDecoder: Mirror architecture with upsampling blocks for decoding latents back to pixel space.SDVAE: Top-level VAE combining encoder and decoder with float16 autocast for both encode and decode operations.
Usage
Use this module as the core diffusion and VAE infrastructure for Stable Diffusion 3. The BaseModel wraps the MM-DiT for denoising, the VAE handles encoding/decoding between pixel and latent space, and the sampling utilities provide the generation loop.
Code Reference
Source Location
- Repository: AUTOMATIC1111_Stable_diffusion_webui
- File: modules/models/sd3/sd3_impls.py
- Lines: 1-374
Signature
class BaseModel(torch.nn.Module):
def __init__(self, shift=1.0, device=None, dtype=torch.float32,
state_dict=None, prefix=""):
def apply_model(self, x, sigma, c_crossattn=None, y=None):
class CFGDenoiser(torch.nn.Module):
def __init__(self, model):
def forward(self, x, timestep, cond, uncond, cond_scale):
class SDVAE(torch.nn.Module):
def __init__(self, dtype=torch.float32, device=None):
def decode(self, latent):
def encode(self, image):
class ModelSamplingDiscreteFlow(torch.nn.Module):
def __init__(self, shift=1.0):
def sigma(self, timestep: torch.Tensor):
def calculate_denoised(self, sigma, model_output, model_input):
class SD3LatentFormat:
def __init__(self):
def process_in(self, latent):
def process_out(self, latent):
def decode_latent_to_preview(self, x0):
Import
from modules.models.sd3.sd3_impls import BaseModel, CFGDenoiser, SDVAE, SD3LatentFormat
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | torch.Tensor | Yes | Noisy latent tensor (N, 16, H, W) |
| sigma | torch.Tensor | Yes | Current noise level sigma for the denoising step |
| c_crossattn | torch.Tensor | No | Cross-attention conditioning from text encoders |
| y | torch.Tensor | No | Pooled vector conditioning |
| image | torch.Tensor | No | Pixel-space image for VAE encoding (N, 3, H, W) |
| latent | torch.Tensor | No | Latent tensor for VAE decoding (N, 16, H, W) |
Outputs
| Name | Type | Description |
|---|---|---|
| denoised | torch.Tensor | Denoised latent prediction from the model |
| decoded | torch.Tensor | Pixel-space image output from VAE decoder (N, 3, H, W) |
| encoded | torch.Tensor | Sampled latent encoding from VAE encoder (N, 16, H, W) |
Usage Examples
from modules.models.sd3.sd3_impls import BaseModel, CFGDenoiser, SDVAE, SD3LatentFormat
# Create the base model from a state dict
base_model = BaseModel(shift=3.0, device="cuda", dtype=torch.float16,
state_dict=state_dict, prefix="model.diffusion_model.")
# Create CFG denoiser wrapper
cfg_model = CFGDenoiser(base_model)
# Apply CFG denoising
denoised = cfg_model(x_noisy, timestep, cond=cond_dict, uncond=uncond_dict,
cond_scale=7.0)
# VAE encode/decode
vae = SDVAE(dtype=torch.float32, device="cuda")
latent = vae.encode(image_tensor)
reconstructed = vae.decode(latent)
# Latent format correction
fmt = SD3LatentFormat()
latent_in = fmt.process_in(raw_latent)
latent_out = fmt.process_out(model_latent)