Implementation:AUTOMATIC1111 Stable diffusion webui FrozenCLIPEmbedderWithCustomWords forward

Knowledge Sources	stable-diffusion-webui
Domains	Diffusion Models, Natural Language Processing, Text Encoding
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for encoding text prompts into conditioning tensors via the CLIP text transformer with support for unlimited prompt length, custom attention weighting, and textual inversion embeddings, provided by the AUTOMATIC1111 stable-diffusion-webui repository.

Description

FrozenCLIPEmbedderWithCustomWords is a PyTorch module that wraps the original FrozenCLIPEmbedder to add three critical capabilities not present in the base CLIP encoder:

Unlimited prompt length -- Prompts exceeding 77 tokens are split into multiple 75-token chunks (plus BOS/EOS tokens), each processed independently through the transformer, then concatenated along the token dimension.

Attention weighting (emphasis) -- Per-token weight multipliers from parse_prompt_attention are applied to the transformer output embeddings using the configured emphasis mode.

Textual inversion embeddings -- Custom embedding vectors can be injected at specific token positions, replacing standard CLIP tokens with learned concept vectors.

The class hierarchy is:

TextConditionalModel -- base class with forward(), process_texts(), and process_tokens()
FrozenCLIPEmbedderWithCustomWordsBase -- adds wrapper and hijack support, overrides forward() to support legacy emphasis
FrozenCLIPEmbedderWithCustomWords -- SD1.x/SD2.x implementation with CLIP tokenizer, transformer encoding, and comma padding

The forward() method (defined in TextConditionalModel at lines 199-251) performs:

Tokenize all input texts into batched chunks via process_texts()
For each chunk position across the batch, extract tokens and multipliers
Call process_tokens() which runs the CLIP transformer and applies emphasis weighting
Concatenate all chunk results via torch.hstack()
Optionally return pooled output for SDXL

Usage

This module is called during the conditioning phase of every generation request, both for the positive prompt and the negative prompt. It is accessed through the hijacked model's cond_stage_model attribute, invoked by get_learned_conditioning() in the prompt parser.

Code Reference

Source Location

Repository: stable-diffusion-webui
File: modules/sd_hijack_clip.py
Class definition: Lines 316-368 (FrozenCLIPEmbedderWithCustomWords)
Forward method: Lines 199-251 (in TextConditionalModel)
Process tokens: Lines 253-285 (in TextConditionalModel)

Signature

class FrozenCLIPEmbedderWithCustomWords(FrozenCLIPEmbedderWithCustomWordsBase):
    def __init__(self, wrapped, hijack):
        ...

    def tokenize(self, texts):
        """Tokenize texts using CLIP tokenizer without truncation."""
        ...

    def encode_with_transformers(self, tokens):
        """Pass tokens through CLIP transformer, respecting CLIP_stop_at_last_layers."""
        ...

# Forward method (inherited from TextConditionalModel):
def forward(self, texts):
    """
    Accepts an array of texts; Passes texts through transformers network to create
    a tensor with numerical representation of those texts.
    Returns a tensor with shape of (B, T, C), where B is length of the array;
    T is length, in tokens, of texts (including padding) - T will be a multiple
    of 77; and C is dimensionality of each token - for SD1 it's 768, for SD2
    it's 1024, and for SDXL it's 1280.
    """

Import

from modules.sd_hijack_clip import FrozenCLIPEmbedderWithCustomWords

I/O Contract

Inputs

Name	Type	Required	Description
texts	list[str]	Yes	An array of text prompts to encode. Usually a single element, but multiple elements are used for prompt editing (e.g., `"a [cat:dog:0.4]"`). Each string may contain attention weighting syntax.

Outputs

Name	Type	Description
return	torch.Tensor	A conditioning tensor of shape (B, T, C) where B is batch size, T is token count (multiple of 77), and C is the embedding dimensionality (768 for SD1, 1024 for SD2, 1280 for SDXL). For SDXL, returns a tuple of (tensor, pooled_tensor) where pooled_tensor has shape (B, 1280).

Usage Examples

Basic Usage

# The forward method is typically called indirectly through the model's
# cond_stage_model during conditioning computation.

import modules.shared as shared

# Access the hijacked CLIP encoder
clip_model = shared.sd_model.cond_stage_model

# Encode a single prompt
conditioning = clip_model(["a beautiful landscape painting"])
# conditioning.shape: (1, 77, 768) for SD1.x

# Encode a long prompt (will be multi-chunk)
long_prompt = "a very detailed description " * 20  # exceeds 77 tokens
conditioning = clip_model([long_prompt])
# conditioning.shape: (1, 154, 768) for SD1.x (2 chunks)

# Encode multiple prompts (for prompt editing)
conditioning = clip_model(["a photo of a cat", "a photo of a dog"])
# conditioning.shape: (2, 77, 768) for SD1.x

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment