Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:AUTOMATIC1111 Stable diffusion webui Prompt encoding

From Leeroopedia


Knowledge Sources
Domains Diffusion Models, Natural Language Processing, Text Encoding
Last Updated 2026-02-08 00:00 GMT

Overview

Prompt encoding is the process of converting a text prompt string into a numerical conditioning tensor via a CLIP text encoder, which the diffusion model's cross-attention layers use to guide image generation.

Description

The text-to-image pipeline requires bridging the gap between natural language and the UNet's numerical conditioning interface. This is accomplished by a CLIP text encoder, a transformer neural network trained on image-text pairs that produces embedding vectors capturing the semantic meaning of text tokens.

The encoding process involves several stages:

  1. Tokenization -- The prompt string is split into subword tokens using CLIP's BPE (Byte Pair Encoding) tokenizer. Each token maps to an integer ID.
  1. Chunking -- CLIP's transformer has a fixed context window of 77 tokens (including start and end tokens, leaving 75 usable tokens). Prompts exceeding this limit must be split into multiple 77-token chunks, each processed independently and concatenated.
  1. Transformer Forward Pass -- Token IDs are passed through the CLIP text transformer, which produces a sequence of embedding vectors. For SD1.x, each token yields a 768-dimensional vector; for SD2.x, 1024-dimensional; for SDXL, a concatenated 1280-dimensional vector from dual encoders.
  1. Emphasis/Attention Weighting -- After the transformer produces raw embeddings, per-token weight multipliers (from prompt attention syntax) are applied. The implementation supports multiple emphasis modes that differ in how weights modify the embedding vectors.
  1. Textual Inversion Embedding Injection -- Custom trained embedding vectors can replace specific tokens, enabling learned concepts that the original model was not trained on.

Usage

Prompt encoding is invoked for every generation request, both for the positive prompt (to produce the conditional embedding) and the negative prompt (to produce the unconditional embedding). These two embeddings are required for classifier-free guidance during sampling.

Theoretical Basis

CLIP Text Encoder Architecture

The CLIP text encoder is a causal transformer that processes token sequences:

tokens = [BOS, t1, t2, ..., tn, EOS, PAD, PAD, ...]  (length 77)
embeddings = Transformer(token_embeddings + positional_embeddings)
output shape: (77, C) where C = 768 (SD1) / 1024 (SD2) / 1280 (SDXL)

For generation, the full sequence of hidden states is used (not just the pooled [EOS] vector), because the UNet's cross-attention layers attend to every position.

Multi-Chunk Encoding

When a prompt exceeds 75 usable tokens, it is split into chunks:

Chunk 1: [BOS, t1, ..., t75, EOS]   -> (77, C)
Chunk 2: [BOS, t76, ..., t150, EOS] -> (77, C)
...
Final: concat along token dimension  -> (77*N, C)

Each chunk is independently processed through the transformer, then the resulting tensors are concatenated along the token dimension. The UNet's cross-attention layers can attend across all chunks.

Emphasis Application

After obtaining raw embeddings z and per-token multipliers m, emphasis is applied. The default method works as:

z_weighted[i] = z[i] * m[i]

This scales each token's embedding vector proportionally, amplifying or attenuating its contribution to cross-attention.

SDXL Dual Encoder

SDXL uses two text encoders simultaneously:

  • CLIP ViT-L/14 -- produces 768-dimensional embeddings (same as SD1)
  • OpenCLIP ViT-bigG/14 -- produces 1280-dimensional embeddings

The outputs are concatenated to form a 2048-dimensional conditioning vector, plus a separate 1280-dimensional pooled vector used for global conditioning.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment