Principle:AUTOMATIC1111 Stable diffusion webui Prompt composition
| Knowledge Sources | |
|---|---|
| Domains | Diffusion Models, Natural Language Processing, Prompt Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Prompt composition is the technique of controlling the relative emphasis placed on individual tokens or phrases within a text prompt by assigning numeric weights that modify the corresponding token embeddings before they condition a diffusion model.
Description
In text-to-image diffusion models, the user's prompt is converted into a sequence of token embeddings by a text encoder (typically CLIP). Each token embedding occupies a position in the conditioning tensor that guides the denoising process. By default, every token receives equal weight (1.0). Prompt attention weighting allows users to increase or decrease the influence of specific tokens by wrapping them in parenthesized syntax.
The core idea is straightforward: after the text encoder produces an embedding vector for each token, each vector is scaled by a per-token multiplier. A multiplier greater than 1.0 amplifies that token's contribution to the cross-attention mechanism in the UNet, making the corresponding concept more prominent in the generated image. A multiplier less than 1.0 attenuates it.
The AUTOMATIC1111 WebUI supports several syntactic forms:
(text)-- multiplies the weight of enclosed tokens by 1.1(text:1.5)-- sets the weight of enclosed tokens to exactly 1.5[text]-- multiplies the weight of enclosed tokens by 1/1.1 (approximately 0.909)- Nested parentheses -- multiplicative stacking, e.g.,
((text))yields 1.1 * 1.1 = 1.21
Escape sequences (\(, \[, \\) allow literal bracket characters.
Usage
Prompt composition with attention weighting is used whenever a user wants fine-grained control over which concepts dominate the generated image. Common scenarios include:
- Emphasizing a subject over its background:
a (cat:1.3) sitting on a park bench - De-emphasizing unwanted but contextually necessary tokens:
a portrait of a woman with [glasses] - Balancing multi-subject compositions:
(astronaut:1.2) and (dinosaur:0.8) on the moon
Theoretical Basis
The mathematical foundation rests on weighted cross-attention. In a standard transformer cross-attention layer:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where Q comes from the UNet spatial features and K, V come from the text encoder output. When we scale a token embedding e_i by weight w_i, we effectively produce:
e_i' = w_i * e_i
This modified embedding shifts the key and value vectors for that token, causing the cross-attention to allocate more (or less) spatial attention to the concept associated with that token.
For nested parentheses, weights are multiplicative:
(((text))) => weight = 1.1 * 1.1 * 1.1 = 1.331
((text:1.5)) => weight = 1.5 * 1.1 = 1.65
Square brackets apply the inverse multiplier:
[text] => weight = 1 / 1.1 ~= 0.909
[[text]] => weight = (1 / 1.1)^2 ~= 0.826
The BREAK keyword forces a chunk boundary, splitting the prompt into separate 77-token segments processed independently by CLIP.