Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:AUTOMATIC1111 Stable diffusion webui SD3 Supporting Models

From Leeroopedia
Revision as of 14:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/AUTOMATIC1111_Stable_diffusion_webui_SD3_Supporting_Models.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Text Encoders, CLIP, T5, Stable Diffusion 3
Last Updated 2025-05-15 00:00 GMT

Overview

Provides standalone implementations of the text encoder models (CLIP-L, CLIP-G, and T5-XXL) and their tokenizers needed by Stable Diffusion 3's triple text conditioning system, independent of the HuggingFace transformers library model classes.

Description

This module implements the complete text encoding pipeline for SD3 through the following components:

Core Utilities:

  • AutocastLinear: A custom linear layer that casts weights to match input dtype, critical for T5 which produces near-zero outputs in pure float16.
  • attention: A convenience wrapper around scaled_dot_product_attention that handles head reshaping.
  • Mlp: A standard two-layer MLP used across CLIP and DiT models.

CLIP Models:

  • CLIPAttention, CLIPLayer, CLIPEncoder, CLIPEmbeddings: Building blocks for the CLIP text encoder, supporting causal masking and intermediate output extraction.
  • CLIPTextModel_ and CLIPTextModel: Complete CLIP text model with embeddings, encoder stack, final layer norm, text projection, and pooled output extraction.
  • SDClipModel: Wraps CLIP into the SD interface with configurable layer extraction (last, pooled, hidden) and textual inversion support.
  • SDXLClipG: Specialized wrapper for the CLIP-G model used in SDXL and SD3.

T5 Models:

  • T5Attention: Self-attention with relative position bias using bucketed distances.
  • T5Block, T5Stack, T5: The T5 encoder stack with gated dense feed-forward layers (T5DenseGatedActDense), custom layer norm, and embedding.
  • T5XXLModel: Wraps T5-XXL into the SDClipModel interface.

Tokenizers:

  • SDTokenizer, SDXLClipGTokenizer, T5XXLTokenizer, SD3Tokenizer: Tokenizer wrappers that handle tokenization with weight parsing, start/end tokens, and padding.

Usage

Use these models as the text encoding backbone for Stable Diffusion 3. They are instantiated by the SD3Cond conditioning module to encode text prompts into the cross-attention and vector conditioning tensors required by the MM-DiT denoiser.

Code Reference

Source Location

Signature

class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
    def __init__(self, device="cpu", max_length=77, layer="last",
                 layer_idx=None, textmodel_json_config=None, dtype=None,
                 model_class=CLIPTextModel, special_tokens=None,
                 layer_norm_hidden_state=True, return_projected_pooled=True):
    def forward(self, tokens):

class T5(torch.nn.Module):
    def __init__(self, config_dict, dtype, device):
    def forward(self, *args, **kwargs):

class SD3Tokenizer:
    def __init__(self):
    def tokenize_with_weights(self, text: str):

Import

from modules.models.sd3.other_impls import SDClipModel, SDXLClipG, T5XXLModel, SD3Tokenizer

I/O Contract

Inputs

Name Type Required Description
tokens list[list[int]] Yes Batch of token ID sequences for text encoding
text str Yes Raw text string for tokenization (tokenizer input)

Outputs

Name Type Description
z torch.Tensor Text encoder hidden states (N, seq_len, hidden_dim)
pooled_output torch.Tensor Pooled text representation for vector conditioning

Usage Examples

from modules.models.sd3.other_impls import SDClipModel, SD3Tokenizer, CLIPTextModel

# Create a CLIP-L text encoder
clip_config = {
    "hidden_act": "quick_gelu",
    "hidden_size": 768,
    "intermediate_size": 3072,
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
}
clip_l = SDClipModel(
    layer="hidden", layer_idx=-2,
    device="cpu", dtype=torch.float16,
    textmodel_json_config=clip_config,
    layer_norm_hidden_state=False,
    return_projected_pooled=False
)

# Tokenize text
tokenizer = SD3Tokenizer()
token_data = tokenizer.tokenize_with_weights("a photo of a cat")

# Encode tokens
z, pooled = clip_l([[t[0] for t in token_data["l"][0]]])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment