Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Diffusers SDXL Encode Prompt

From Leeroopedia
Knowledge Sources
Domains Diffusion_Models, Text_Encoding, CLIP, Classifier_Free_Guidance
Last Updated 2026-02-13 21:00 GMT

Overview

Concrete tool for encoding text prompts into CLIP embeddings for conditioning the Stable Diffusion XL denoising process provided by the Diffusers library.

Description

StableDiffusionXLPipeline.encode_prompt encodes text prompts through SDXL's dual text encoder architecture. It tokenizes the input prompt using both tokenizers (tokenizer for CLIP ViT-L and tokenizer_2 for OpenCLIP ViT-bigG), passes the token IDs through their respective text encoders, extracts hidden states from the penultimate layer (or an earlier layer if clip_skip is set), and concatenates the outputs along the hidden dimension. The method also extracts the pooled output from the second text encoder.

When classifier-free guidance is enabled (do_classifier_free_guidance=True), the method additionally encodes the negative prompt (or zeros if force_zeros_for_empty_prompt is configured) to produce unconditional embeddings. If LoRA layers are loaded on the text encoders, the method adjusts the LoRA scale before encoding. The method supports pre-computed embeddings via the prompt_embeds and related parameters, allowing users to bypass the encoding step for optimization or manual embedding manipulation.

The second prompt parameter (prompt_2) allows sending a different prompt to the second text encoder, which can be useful for multi-aspect prompt control.

Usage

This method is called internally by StableDiffusionXLPipeline.__call__ during the standard inference flow. Call it directly when you need to pre-compute embeddings for reuse across multiple generations, implement prompt interpolation, or manually manipulate the embedding tensors before passing them to the pipeline.

Code Reference

Source Location

  • Repository: diffusers
  • File: src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
  • Lines: 243-444

Signature

def encode_prompt(
    self,
    prompt: str,
    prompt_2: str | None = None,
    device: torch.device | None = None,
    num_images_per_prompt: int = 1,
    do_classifier_free_guidance: bool = True,
    negative_prompt: str | None = None,
    negative_prompt_2: str | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    pooled_prompt_embeds: torch.Tensor | None = None,
    negative_pooled_prompt_embeds: torch.Tensor | None = None,
    lora_scale: float | None = None,
    clip_skip: int | None = None,
):

Import

from diffusers import StableDiffusionXLPipeline
# encode_prompt is an instance method on the SDXL pipeline

I/O Contract

Inputs

Name Type Required Description
prompt str or list[str] Yes* The prompt or prompts to encode. Required unless prompt_embeds is provided.
prompt_2 str or list[str] or None No Separate prompt for the second text encoder (tokenizer_2 / text_encoder_2). If None, defaults to prompt.
device torch.device or None No Target device for the output tensors. Defaults to the pipeline's execution device.
num_images_per_prompt int No Number of images to generate per prompt. The embeddings are repeated accordingly. Defaults to 1.
do_classifier_free_guidance bool No Whether to compute unconditional (negative) embeddings for classifier-free guidance. Defaults to True.
negative_prompt str or list[str] or None No The negative prompt for guidance. If None and force_zeros_for_empty_prompt is set, zero embeddings are used.
negative_prompt_2 str or list[str] or None No Separate negative prompt for the second text encoder. Defaults to negative_prompt.
prompt_embeds torch.Tensor or None No Pre-computed prompt embeddings. Bypasses text encoding when provided.
negative_prompt_embeds torch.Tensor or None No Pre-computed negative prompt embeddings.
pooled_prompt_embeds torch.Tensor or None No Pre-computed pooled prompt embeddings from the second text encoder.
negative_pooled_prompt_embeds torch.Tensor or None No Pre-computed negative pooled prompt embeddings.
lora_scale float or None No Scale factor applied to LoRA layers in the text encoders. Only effective when LoRA weights are loaded.
clip_skip int or None No Number of CLIP layers to skip from the end. A value of 1 uses the pre-final layer output. Commonly used for anime-style models.

Outputs

Name Type Description
prompt_embeds torch.Tensor Concatenated text encoder hidden states. Shape: [batch * num_images_per_prompt, seq_len, 2048] for SDXL (768 from ViT-L + 1280 from ViT-bigG).
negative_prompt_embeds torch.Tensor Unconditional embeddings for classifier-free guidance. Same shape as prompt_embeds.
pooled_prompt_embeds torch.Tensor Pooled output from the second text encoder. Shape: [batch * num_images_per_prompt, 1280].
negative_pooled_prompt_embeds torch.Tensor Pooled negative embeddings. Same shape as pooled_prompt_embeds.

Usage Examples

Basic Usage

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Pre-compute prompt embeddings for reuse
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
    prompt="A beautiful sunset over the ocean",
    prompt_2=None,  # will use the same prompt
    device="cuda",
    num_images_per_prompt=1,
    do_classifier_free_guidance=True,
    negative_prompt="blurry, low quality",
)

# Use pre-computed embeddings for multiple generations
for seed in range(5):
    image = pipe(
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
        generator=torch.manual_seed(seed),
    ).images[0]
    image.save(f"sunset_{seed}.png")

With Clip Skip

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Skip 1 CLIP layer (common for anime models)
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(
    prompt="1girl, cherry blossoms, detailed anime style",
    device="cuda",
    do_classifier_free_guidance=True,
    clip_skip=1,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment