Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Supporting Models
| Knowledge Sources | |
|---|---|
| Domains | Text Encoders, CLIP, T5, Stable Diffusion 3 |
| Last Updated | 2025-05-15 00:00 GMT |
Overview
Provides standalone implementations of the text encoder models (CLIP-L, CLIP-G, and T5-XXL) and their tokenizers needed by Stable Diffusion 3's triple text conditioning system, independent of the HuggingFace transformers library model classes.
Description
This module implements the complete text encoding pipeline for SD3 through the following components:
Core Utilities:
AutocastLinear: A custom linear layer that casts weights to match input dtype, critical for T5 which produces near-zero outputs in pure float16.attention: A convenience wrapper aroundscaled_dot_product_attentionthat handles head reshaping.Mlp: A standard two-layer MLP used across CLIP and DiT models.
CLIP Models:
CLIPAttention,CLIPLayer,CLIPEncoder,CLIPEmbeddings: Building blocks for the CLIP text encoder, supporting causal masking and intermediate output extraction.CLIPTextModel_andCLIPTextModel: Complete CLIP text model with embeddings, encoder stack, final layer norm, text projection, and pooled output extraction.SDClipModel: Wraps CLIP into the SD interface with configurable layer extraction (last, pooled, hidden) and textual inversion support.SDXLClipG: Specialized wrapper for the CLIP-G model used in SDXL and SD3.
T5 Models:
T5Attention: Self-attention with relative position bias using bucketed distances.T5Block,T5Stack,T5: The T5 encoder stack with gated dense feed-forward layers (T5DenseGatedActDense), custom layer norm, and embedding.T5XXLModel: Wraps T5-XXL into the SDClipModel interface.
Tokenizers:
SDTokenizer,SDXLClipGTokenizer,T5XXLTokenizer,SD3Tokenizer: Tokenizer wrappers that handle tokenization with weight parsing, start/end tokens, and padding.
Usage
Use these models as the text encoding backbone for Stable Diffusion 3. They are instantiated by the SD3Cond conditioning module to encode text prompts into the cross-attention and vector conditioning tensors required by the MM-DiT denoiser.
Code Reference
Source Location
- Repository: AUTOMATIC1111_Stable_diffusion_webui
- File: modules/models/sd3/other_impls.py
- Lines: 1-510
Signature
class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
def __init__(self, device="cpu", max_length=77, layer="last",
layer_idx=None, textmodel_json_config=None, dtype=None,
model_class=CLIPTextModel, special_tokens=None,
layer_norm_hidden_state=True, return_projected_pooled=True):
def forward(self, tokens):
class T5(torch.nn.Module):
def __init__(self, config_dict, dtype, device):
def forward(self, *args, **kwargs):
class SD3Tokenizer:
def __init__(self):
def tokenize_with_weights(self, text: str):
Import
from modules.models.sd3.other_impls import SDClipModel, SDXLClipG, T5XXLModel, SD3Tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokens | list[list[int]] | Yes | Batch of token ID sequences for text encoding |
| text | str | Yes | Raw text string for tokenization (tokenizer input) |
Outputs
| Name | Type | Description |
|---|---|---|
| z | torch.Tensor | Text encoder hidden states (N, seq_len, hidden_dim) |
| pooled_output | torch.Tensor | Pooled text representation for vector conditioning |
Usage Examples
from modules.models.sd3.other_impls import SDClipModel, SD3Tokenizer, CLIPTextModel
# Create a CLIP-L text encoder
clip_config = {
"hidden_act": "quick_gelu",
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12,
"num_hidden_layers": 12,
}
clip_l = SDClipModel(
layer="hidden", layer_idx=-2,
device="cpu", dtype=torch.float16,
textmodel_json_config=clip_config,
layer_norm_hidden_state=False,
return_projected_pooled=False
)
# Tokenize text
tokenizer = SD3Tokenizer()
token_data = tokenizer.tokenize_with_weights("a photo of a cat")
# Encode tokens
z, pooled = clip_l([[t[0] for t in token_data["l"][0]]])