Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Conditioning
| Knowledge Sources | |
|---|---|
| Domains | Text Conditioning, Stable Diffusion 3, Model Loading |
| Last Updated | 2025-05-15 00:00 GMT |
Overview
Implements the SD3 text conditioning pipeline that combines CLIP-L, CLIP-G, and T5-XXL text encoders to produce the dual conditioning signals (cross-attention context and pooled vector) required by the MM-DiT denoiser.
Description
This module orchestrates the triple text encoder architecture for Stable Diffusion 3:
SafetensorsMapping: A helper class implementingtyping.Mappingto lazily read tensors from safetensors files, enabling efficient weight loading without materializing the entire file into memory.
Sd3ClipLG: Wraps CLIP-L and CLIP-G encoders into a unified interface extendingsd_hijack_clip.TextConditionalModel. It tokenizes prompts, runs both CLIP models, concatenates their hidden states (padding to 4096 dimensions for cross-attention), and concatenates their pooled outputs for the vector condition. Handles end-of-sequence token masking for CLIP-G.
Sd3T5: Wraps the T5-XXL encoder with prompt attention weight parsing support. Falls back to zero tensors when T5 is disabled viashared.opts.sd3_enable_t5. Supports emphasis/weight syntax throughprompt_parser.parse_prompt_attention.
SD3Cond: The top-level conditioning module that initializes all three text encoders, manages automatic downloading of missing encoder weights from HuggingFace URLs, and orchestrates the forward pass. Itsforwardmethod produces a dictionary with'crossattn'(concatenated LG+T5 tokens) and'vector'(pooled CLIP output) keys. Also providesbefore_load_weightsto handle separate encoder weight loading andmedvram_modulesfor memory optimization.
Encoder weight URLs point to the AUTOMATIC1111 HuggingFace repository for clip_l, clip_g, and t5xxl_fp16 safetensors files.
Usage
Use this module as the text conditioning component for Stable Diffusion 3. It is instantiated as part of the SD3 model pipeline and called during each generation to convert text prompts into the conditioning tensors consumed by the MM-DiT denoiser.
Code Reference
Source Location
- Repository: AUTOMATIC1111_Stable_diffusion_webui
- File: modules/models/sd3/sd3_cond.py
- Lines: 1-222
Signature
class SD3Cond(torch.nn.Module):
def __init__(self, *args, **kwargs):
def forward(self, prompts: list[str]):
def before_load_weights(self, state_dict):
def medvram_modules(self):
def get_token_count(self, text):
def get_target_prompt_token_count(self, token_count):
class Sd3ClipLG(sd_hijack_clip.TextConditionalModel):
def __init__(self, clip_l, clip_g):
def encode_with_transformers(self, tokens):
class Sd3T5(torch.nn.Module):
def __init__(self, t5xxl):
def forward(self, texts, *, token_count):
Import
from modules.models.sd3.sd3_cond import SD3Cond
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompts | list[str] | Yes | List of text prompt strings to encode |
Outputs
| Name | Type | Description |
|---|---|---|
| result | dict | Dictionary with 'crossattn' (torch.Tensor of concatenated LG+T5 encodings) and 'vector' (torch.Tensor of pooled CLIP output) |
Usage Examples
from modules.models.sd3.sd3_cond import SD3Cond
# SD3Cond is typically instantiated as part of the SD3 model loading pipeline
cond_module = SD3Cond()
# Encode prompts into conditioning tensors
conditioning = cond_module(["a photo of a sunset over the ocean"])
# conditioning['crossattn'] -> (1, token_count, 4096) cross-attention context
# conditioning['vector'] -> (1, 2048) pooled CLIP vector
# Get token count for a prompt
token_count = cond_module.get_token_count("a photo of a sunset")
# Before loading model weights, handle separate encoder downloads
cond_module.before_load_weights(model_state_dict)