Implementation:Huggingface Diffusers SDXL Encode Prompt

Knowledge Sources	Diffusers Diffusers Docs
Domains	Diffusion_Models, Text_Encoding, CLIP, Classifier_Free_Guidance
Last Updated	2026-02-13 21:00 GMT

Overview

Concrete tool for encoding text prompts into CLIP embeddings for conditioning the Stable Diffusion XL denoising process provided by the Diffusers library.

Description

StableDiffusionXLPipeline.encode_prompt encodes text prompts through SDXL's dual text encoder architecture. It tokenizes the input prompt using both tokenizers (tokenizer for CLIP ViT-L and tokenizer_2 for OpenCLIP ViT-bigG), passes the token IDs through their respective text encoders, extracts hidden states from the penultimate layer (or an earlier layer if clip_skip is set), and concatenates the outputs along the hidden dimension. The method also extracts the pooled output from the second text encoder.

When classifier-free guidance is enabled (do_classifier_free_guidance=True), the method additionally encodes the negative prompt (or zeros if force_zeros_for_empty_prompt is configured) to produce unconditional embeddings. If LoRA layers are loaded on the text encoders, the method adjusts the LoRA scale before encoding. The method supports pre-computed embeddings via the prompt_embeds and related parameters, allowing users to bypass the encoding step for optimization or manual embedding manipulation.

The second prompt parameter (prompt_2) allows sending a different prompt to the second text encoder, which can be useful for multi-aspect prompt control.

Usage

This method is called internally by StableDiffusionXLPipeline.__call__ during the standard inference flow. Call it directly when you need to pre-compute embeddings for reuse across multiple generations, implement prompt interpolation, or manually manipulate the embedding tensors before passing them to the pipeline.

Code Reference

Source Location

Repository: diffusers
File: src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Lines: 243-444

Signature

def encode_prompt(
    self,
    prompt: str,
    prompt_2: str | None = None,
    device: torch.device | None = None,
    num_images_per_prompt: int = 1,
    do_classifier_free_guidance: bool = True,
    negative_prompt: str | None = None,
    negative_prompt_2: str | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    pooled_prompt_embeds: torch.Tensor | None = None,
    negative_pooled_prompt_embeds: torch.Tensor | None = None,
    lora_scale: float | None = None,
    clip_skip: int | None = None,
):

Import

from diffusers import StableDiffusionXLPipeline
# encode_prompt is an instance method on the SDXL pipeline

I/O Contract

Inputs

Name	Type	Required	Description
prompt	`str` or `list[str]`	Yes*	The prompt or prompts to encode. Required unless `prompt_embeds` is provided.
prompt_2	`str` or `list[str]` or `None`	No	Separate prompt for the second text encoder (`tokenizer_2` / `text_encoder_2`). If `None`, defaults to `prompt`.
device	`torch.device` or `None`	No	Target device for the output tensors. Defaults to the pipeline's execution device.
num_images_per_prompt	`int`	No	Number of images to generate per prompt. The embeddings are repeated accordingly. Defaults to 1.
do_classifier_free_guidance	`bool`	No	Whether to compute unconditional (negative) embeddings for classifier-free guidance. Defaults to `True`.
negative_prompt	`str` or `list[str]` or `None`	No	The negative prompt for guidance. If `None` and `force_zeros_for_empty_prompt` is set, zero embeddings are used.
negative_prompt_2	`str` or `list[str]` or `None`	No	Separate negative prompt for the second text encoder. Defaults to `negative_prompt`.
prompt_embeds	`torch.Tensor` or `None`	No	Pre-computed prompt embeddings. Bypasses text encoding when provided.
negative_prompt_embeds	`torch.Tensor` or `None`	No	Pre-computed negative prompt embeddings.
pooled_prompt_embeds	`torch.Tensor` or `None`	No	Pre-computed pooled prompt embeddings from the second text encoder.
negative_pooled_prompt_embeds	`torch.Tensor` or `None`	No	Pre-computed negative pooled prompt embeddings.
lora_scale	`float` or `None`	No	Scale factor applied to LoRA layers in the text encoders. Only effective when LoRA weights are loaded.
clip_skip	`int` or `None`	No	Number of CLIP layers to skip from the end. A value of 1 uses the pre-final layer output. Commonly used for anime-style models.

Outputs

Name	Type	Description
prompt_embeds	`torch.Tensor`	Concatenated text encoder hidden states. Shape: `[batch * num_images_per_prompt, seq_len, 2048]` for SDXL (768 from ViT-L + 1280 from ViT-bigG).
negative_prompt_embeds	`torch.Tensor`	Unconditional embeddings for classifier-free guidance. Same shape as `prompt_embeds`.
pooled_prompt_embeds	`torch.Tensor`	Pooled output from the second text encoder. Shape: `[batch * num_images_per_prompt, 1280]`.
negative_pooled_prompt_embeds	`torch.Tensor`	Pooled negative embeddings. Same shape as `pooled_prompt_embeds`.

Usage Examples

Basic Usage

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Pre-compute prompt embeddings for reuse
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
    prompt="A beautiful sunset over the ocean",
    prompt_2=None,  # will use the same prompt
    device="cuda",
    num_images_per_prompt=1,
    do_classifier_free_guidance=True,
    negative_prompt="blurry, low quality",
)

# Use pre-computed embeddings for multiple generations
for seed in range(5):
    image = pipe(
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
        generator=torch.manual_seed(seed),
    ).images[0]
    image.save(f"sunset_{seed}.png")

With Clip Skip

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Skip 1 CLIP layer (common for anime models)
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
    prompt="1girl, cherry blossoms, detailed anime style",
    device="cuda",
    do_classifier_free_guidance=True,
    clip_skip=1,
)

Related Pages

Implements Principle

Principle:Huggingface_Diffusers_Prompt_Encoding

Requires Environment

Environment:Huggingface_Diffusers_PyTorch_CUDA_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment