Implementation:Huggingface Diffusers SDXL Encode Prompt
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Text_Encoding, CLIP, Classifier_Free_Guidance |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Concrete tool for encoding text prompts into CLIP embeddings for conditioning the Stable Diffusion XL denoising process provided by the Diffusers library.
Description
StableDiffusionXLPipeline.encode_prompt encodes text prompts through SDXL's dual text encoder architecture. It tokenizes the input prompt using both tokenizers (tokenizer for CLIP ViT-L and tokenizer_2 for OpenCLIP ViT-bigG), passes the token IDs through their respective text encoders, extracts hidden states from the penultimate layer (or an earlier layer if clip_skip is set), and concatenates the outputs along the hidden dimension. The method also extracts the pooled output from the second text encoder.
When classifier-free guidance is enabled (do_classifier_free_guidance=True), the method additionally encodes the negative prompt (or zeros if force_zeros_for_empty_prompt is configured) to produce unconditional embeddings. If LoRA layers are loaded on the text encoders, the method adjusts the LoRA scale before encoding. The method supports pre-computed embeddings via the prompt_embeds and related parameters, allowing users to bypass the encoding step for optimization or manual embedding manipulation.
The second prompt parameter (prompt_2) allows sending a different prompt to the second text encoder, which can be useful for multi-aspect prompt control.
Usage
This method is called internally by StableDiffusionXLPipeline.__call__ during the standard inference flow. Call it directly when you need to pre-compute embeddings for reuse across multiple generations, implement prompt interpolation, or manually manipulate the embedding tensors before passing them to the pipeline.
Code Reference
Source Location
- Repository: diffusers
- File:
src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py - Lines: 243-444
Signature
def encode_prompt(
self,
prompt: str,
prompt_2: str | None = None,
device: torch.device | None = None,
num_images_per_prompt: int = 1,
do_classifier_free_guidance: bool = True,
negative_prompt: str | None = None,
negative_prompt_2: str | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
pooled_prompt_embeds: torch.Tensor | None = None,
negative_pooled_prompt_embeds: torch.Tensor | None = None,
lora_scale: float | None = None,
clip_skip: int | None = None,
):
Import
from diffusers import StableDiffusionXLPipeline
# encode_prompt is an instance method on the SDXL pipeline
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str or list[str] |
Yes* | The prompt or prompts to encode. Required unless prompt_embeds is provided.
|
| prompt_2 | str or list[str] or None |
No | Separate prompt for the second text encoder (tokenizer_2 / text_encoder_2). If None, defaults to prompt.
|
| device | torch.device or None |
No | Target device for the output tensors. Defaults to the pipeline's execution device. |
| num_images_per_prompt | int |
No | Number of images to generate per prompt. The embeddings are repeated accordingly. Defaults to 1. |
| do_classifier_free_guidance | bool |
No | Whether to compute unconditional (negative) embeddings for classifier-free guidance. Defaults to True.
|
| negative_prompt | str or list[str] or None |
No | The negative prompt for guidance. If None and force_zeros_for_empty_prompt is set, zero embeddings are used.
|
| negative_prompt_2 | str or list[str] or None |
No | Separate negative prompt for the second text encoder. Defaults to negative_prompt.
|
| prompt_embeds | torch.Tensor or None |
No | Pre-computed prompt embeddings. Bypasses text encoding when provided. |
| negative_prompt_embeds | torch.Tensor or None |
No | Pre-computed negative prompt embeddings. |
| pooled_prompt_embeds | torch.Tensor or None |
No | Pre-computed pooled prompt embeddings from the second text encoder. |
| negative_pooled_prompt_embeds | torch.Tensor or None |
No | Pre-computed negative pooled prompt embeddings. |
| lora_scale | float or None |
No | Scale factor applied to LoRA layers in the text encoders. Only effective when LoRA weights are loaded. |
| clip_skip | int or None |
No | Number of CLIP layers to skip from the end. A value of 1 uses the pre-final layer output. Commonly used for anime-style models. |
Outputs
| Name | Type | Description |
|---|---|---|
| prompt_embeds | torch.Tensor |
Concatenated text encoder hidden states. Shape: [batch * num_images_per_prompt, seq_len, 2048] for SDXL (768 from ViT-L + 1280 from ViT-bigG).
|
| negative_prompt_embeds | torch.Tensor |
Unconditional embeddings for classifier-free guidance. Same shape as prompt_embeds.
|
| pooled_prompt_embeds | torch.Tensor |
Pooled output from the second text encoder. Shape: [batch * num_images_per_prompt, 1280].
|
| negative_pooled_prompt_embeds | torch.Tensor |
Pooled negative embeddings. Same shape as pooled_prompt_embeds.
|
Usage Examples
Basic Usage
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
# Pre-compute prompt embeddings for reuse
(
prompt_embeds,
negative_prompt_embeds,
pooled_prompt_embeds,
negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
prompt="A beautiful sunset over the ocean",
prompt_2=None, # will use the same prompt
device="cuda",
num_images_per_prompt=1,
do_classifier_free_guidance=True,
negative_prompt="blurry, low quality",
)
# Use pre-computed embeddings for multiple generations
for seed in range(5):
image = pipe(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
generator=torch.manual_seed(seed),
).images[0]
image.save(f"sunset_{seed}.png")
With Clip Skip
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
# Skip 1 CLIP layer (common for anime models)
(
prompt_embeds,
negative_prompt_embeds,
pooled_prompt_embeds,
negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
prompt="1girl, cherry blossoms, detailed anime style",
device="cuda",
do_classifier_free_guidance=True,
clip_skip=1,
)