Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Conditioning

Knowledge Sources	AUTOMATIC1111_Stable_diffusion_webui
Domains	Text Conditioning, Stable Diffusion 3, Model Loading
Last Updated	2025-05-15 00:00 GMT

Overview

Implements the SD3 text conditioning pipeline that combines CLIP-L, CLIP-G, and T5-XXL text encoders to produce the dual conditioning signals (cross-attention context and pooled vector) required by the MM-DiT denoiser.

Description

This module orchestrates the triple text encoder architecture for Stable Diffusion 3:

SafetensorsMapping: A helper class implementing typing.Mapping to lazily read tensors from safetensors files, enabling efficient weight loading without materializing the entire file into memory.

Sd3ClipLG: Wraps CLIP-L and CLIP-G encoders into a unified interface extending sd_hijack_clip.TextConditionalModel. It tokenizes prompts, runs both CLIP models, concatenates their hidden states (padding to 4096 dimensions for cross-attention), and concatenates their pooled outputs for the vector condition. Handles end-of-sequence token masking for CLIP-G.

Sd3T5: Wraps the T5-XXL encoder with prompt attention weight parsing support. Falls back to zero tensors when T5 is disabled via shared.opts.sd3_enable_t5. Supports emphasis/weight syntax through prompt_parser.parse_prompt_attention.

SD3Cond: The top-level conditioning module that initializes all three text encoders, manages automatic downloading of missing encoder weights from HuggingFace URLs, and orchestrates the forward pass. Its forward method produces a dictionary with 'crossattn' (concatenated LG+T5 tokens) and 'vector' (pooled CLIP output) keys. Also provides before_load_weights to handle separate encoder weight loading and medvram_modules for memory optimization.

Encoder weight URLs point to the AUTOMATIC1111 HuggingFace repository for clip_l, clip_g, and t5xxl_fp16 safetensors files.

Usage

Use this module as the text conditioning component for Stable Diffusion 3. It is instantiated as part of the SD3 model pipeline and called during each generation to convert text prompts into the conditioning tensors consumed by the MM-DiT denoiser.

Code Reference

Source Location

Repository: AUTOMATIC1111_Stable_diffusion_webui
File: modules/models/sd3/sd3_cond.py
Lines: 1-222

Signature

class SD3Cond(torch.nn.Module):
    def __init__(self, *args, **kwargs):
    def forward(self, prompts: list[str]):
    def before_load_weights(self, state_dict):
    def medvram_modules(self):
    def get_token_count(self, text):
    def get_target_prompt_token_count(self, token_count):

class Sd3ClipLG(sd_hijack_clip.TextConditionalModel):
    def __init__(self, clip_l, clip_g):
    def encode_with_transformers(self, tokens):

class Sd3T5(torch.nn.Module):
    def __init__(self, t5xxl):
    def forward(self, texts, *, token_count):

Import

from modules.models.sd3.sd3_cond import SD3Cond

I/O Contract

Inputs

Name	Type	Required	Description
prompts	list[str]	Yes	List of text prompt strings to encode

Outputs

Name	Type	Description
result	dict	Dictionary with 'crossattn' (torch.Tensor of concatenated LG+T5 encodings) and 'vector' (torch.Tensor of pooled CLIP output)

Usage Examples

from modules.models.sd3.sd3_cond import SD3Cond

# SD3Cond is typically instantiated as part of the SD3 model loading pipeline
cond_module = SD3Cond()

# Encode prompts into conditioning tensors
conditioning = cond_module(["a photo of a sunset over the ocean"])
# conditioning['crossattn'] -> (1, token_count, 4096) cross-attention context
# conditioning['vector'] -> (1, 2048) pooled CLIP vector

# Get token count for a prompt
token_count = cond_module.get_token_count("a photo of a sunset")

# Before loading model weights, handle separate encoder downloads
cond_module.before_load_weights(model_state_dict)

Related Pages

Principle:AUTOMATIC1111_Stable_diffusion_webui_SD3_Text_Conditioning_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment