Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Supporting Models

Knowledge Sources	AUTOMATIC1111_Stable_diffusion_webui
Domains	Text Encoders, CLIP, T5, Stable Diffusion 3
Last Updated	2025-05-15 00:00 GMT

Overview

Provides standalone implementations of the text encoder models (CLIP-L, CLIP-G, and T5-XXL) and their tokenizers needed by Stable Diffusion 3's triple text conditioning system, independent of the HuggingFace transformers library model classes.

Description

This module implements the complete text encoding pipeline for SD3 through the following components:

Core Utilities:

AutocastLinear: A custom linear layer that casts weights to match input dtype, critical for T5 which produces near-zero outputs in pure float16.
attention: A convenience wrapper around scaled_dot_product_attention that handles head reshaping.
Mlp: A standard two-layer MLP used across CLIP and DiT models.

CLIP Models:

CLIPAttention, CLIPLayer, CLIPEncoder, CLIPEmbeddings: Building blocks for the CLIP text encoder, supporting causal masking and intermediate output extraction.
CLIPTextModel_ and CLIPTextModel: Complete CLIP text model with embeddings, encoder stack, final layer norm, text projection, and pooled output extraction.
SDClipModel: Wraps CLIP into the SD interface with configurable layer extraction (last, pooled, hidden) and textual inversion support.
SDXLClipG: Specialized wrapper for the CLIP-G model used in SDXL and SD3.

T5 Models:

T5Attention: Self-attention with relative position bias using bucketed distances.
T5Block, T5Stack, T5: The T5 encoder stack with gated dense feed-forward layers (T5DenseGatedActDense), custom layer norm, and embedding.
T5XXLModel: Wraps T5-XXL into the SDClipModel interface.

Tokenizers:

SDTokenizer, SDXLClipGTokenizer, T5XXLTokenizer, SD3Tokenizer: Tokenizer wrappers that handle tokenization with weight parsing, start/end tokens, and padding.

Usage

Use these models as the text encoding backbone for Stable Diffusion 3. They are instantiated by the SD3Cond conditioning module to encode text prompts into the cross-attention and vector conditioning tensors required by the MM-DiT denoiser.

Code Reference

Source Location

Repository: AUTOMATIC1111_Stable_diffusion_webui
File: modules/models/sd3/other_impls.py
Lines: 1-510

Signature

class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
    def __init__(self, device="cpu", max_length=77, layer="last",
                 layer_idx=None, textmodel_json_config=None, dtype=None,
                 model_class=CLIPTextModel, special_tokens=None,
                 layer_norm_hidden_state=True, return_projected_pooled=True):
    def forward(self, tokens):

class T5(torch.nn.Module):
    def __init__(self, config_dict, dtype, device):
    def forward(self, *args, **kwargs):

class SD3Tokenizer:
    def __init__(self):
    def tokenize_with_weights(self, text: str):

Import

from modules.models.sd3.other_impls import SDClipModel, SDXLClipG, T5XXLModel, SD3Tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
tokens	list[list[int]]	Yes	Batch of token ID sequences for text encoding
text	str	Yes	Raw text string for tokenization (tokenizer input)

Outputs

Name	Type	Description
z	torch.Tensor	Text encoder hidden states (N, seq_len, hidden_dim)
pooled_output	torch.Tensor	Pooled text representation for vector conditioning

Usage Examples

from modules.models.sd3.other_impls import SDClipModel, SD3Tokenizer, CLIPTextModel

# Create a CLIP-L text encoder
clip_config = {
    "hidden_act": "quick_gelu",
    "hidden_size": 768,
    "intermediate_size": 3072,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
}
clip_l = SDClipModel(
    layer="hidden", layer_idx=-2,
    device="cpu", dtype=torch.float16,
    textmodel_json_config=clip_config,
    layer_norm_hidden_state=False,
    return_projected_pooled=False
)

# Tokenize text
tokenizer = SD3Tokenizer()
token_data = tokenizer.tokenize_with_weights("a photo of a cat")

# Encode tokens
z, pooled = clip_l([[t[0] for t in token_data["l"][0]]])

Related Pages

Principle:AUTOMATIC1111_Stable_diffusion_webui_Text_Encoder_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment