Principle:AUTOMATIC1111 Stable diffusion webui Prompt composition

Knowledge Sources	Attention Is All You Need High-Resolution Image Synthesis with Latent Diffusion Models
Domains	Diffusion Models, Natural Language Processing, Prompt Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Prompt composition is the technique of controlling the relative emphasis placed on individual tokens or phrases within a text prompt by assigning numeric weights that modify the corresponding token embeddings before they condition a diffusion model.

Description

In text-to-image diffusion models, the user's prompt is converted into a sequence of token embeddings by a text encoder (typically CLIP). Each token embedding occupies a position in the conditioning tensor that guides the denoising process. By default, every token receives equal weight (1.0). Prompt attention weighting allows users to increase or decrease the influence of specific tokens by wrapping them in parenthesized syntax.

The core idea is straightforward: after the text encoder produces an embedding vector for each token, each vector is scaled by a per-token multiplier. A multiplier greater than 1.0 amplifies that token's contribution to the cross-attention mechanism in the UNet, making the corresponding concept more prominent in the generated image. A multiplier less than 1.0 attenuates it.

The AUTOMATIC1111 WebUI supports several syntactic forms:

(text) -- multiplies the weight of enclosed tokens by 1.1
(text:1.5) -- sets the weight of enclosed tokens to exactly 1.5
[text] -- multiplies the weight of enclosed tokens by 1/1.1 (approximately 0.909)
Nested parentheses -- multiplicative stacking, e.g., ((text)) yields 1.1 * 1.1 = 1.21

Escape sequences (\(, \[, \\) allow literal bracket characters.

Usage

Prompt composition with attention weighting is used whenever a user wants fine-grained control over which concepts dominate the generated image. Common scenarios include:

Emphasizing a subject over its background: a (cat:1.3) sitting on a park bench
De-emphasizing unwanted but contextually necessary tokens: a portrait of a woman with [glasses]
Balancing multi-subject compositions: (astronaut:1.2) and (dinosaur:0.8) on the moon

Theoretical Basis

The mathematical foundation rests on weighted cross-attention. In a standard transformer cross-attention layer:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

where Q comes from the UNet spatial features and K, V come from the text encoder output. When we scale a token embedding e_i by weight w_i, we effectively produce:

e_i' = w_i * e_i

This modified embedding shifts the key and value vectors for that token, causing the cross-attention to allocate more (or less) spatial attention to the concept associated with that token.

For nested parentheses, weights are multiplicative:

(((text))) => weight = 1.1 * 1.1 * 1.1 = 1.331
((text:1.5)) => weight = 1.5 * 1.1 = 1.65

Square brackets apply the inverse multiplier:

[text] => weight = 1 / 1.1 ~= 0.909
[[text]] => weight = (1 / 1.1)^2 ~= 0.826

The BREAK keyword forces a chunk boundary, splitting the prompt into separate 77-token segments processed independently by CLIP.

Related Pages

Implemented By

Implementation:AUTOMATIC1111_Stable_diffusion_webui_Parse_prompt_attention

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment