Implementation:Huggingface Diffusers SDXL Pipeline Call

Knowledge Sources	Diffusers Diffusers Docs
Domains	Diffusion_Models, Denoising, Latent_Diffusion, Classifier_Free_Guidance
Last Updated	2026-02-13 21:00 GMT

Overview

Concrete tool for executing the full text-to-image generation pipeline including prompt encoding, denoising, and latent decoding provided by the Diffusers library.

Description

StableDiffusionXLPipeline.__call__ is the main entry point for generating images with SDXL. When a pipeline instance is called (e.g., pipe("a photo of a cat")), this method orchestrates the entire generation workflow:

Input validation: Checks prompt types, dimensions, and parameter consistency.
Prompt encoding: Calls encode_prompt with both text encoders to produce conditional and unconditional embeddings.
Timestep preparation: Configures the scheduler with the requested number of inference steps.
Latent initialization: Creates random Gaussian noise latents (or uses provided ones) at the correct shape for the UNet.
Added conditioning: Computes SDXL-specific time IDs encoding original size, crop coordinates, and target size.
Denoising loop: Iterates over timesteps, running the UNet with classifier-free guidance and the scheduler step function.
VAE decoding: Unscales the denoised latents and decodes them through the VAE. Handles VAE upcasting to float32 when needed.
Post-processing: Applies optional watermarking and converts the raw tensor to the requested output format via VaeImageProcessor.postprocess.
Cleanup: Calls maybe_free_model_hooks to offload models if CPU offloading is active.

The method supports numerous advanced features including custom timestep schedules, IP-Adapter image conditioning, denoising_end for refiner pipeline handoff, guidance rescale for zero-terminal-SNR correction, and step-end callbacks for intermediate inspection.

Usage

Call this method (via pipe(...)) to generate images from text prompts. This is the standard inference API for SDXL text-to-image generation. All parameters have sensible defaults, so minimal usage only requires a prompt string.

Code Reference

Source Location

Repository: diffusers
File: src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
Lines: 976-1301

Signature

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    prompt_2: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_inference_steps: int = 50,
    timesteps: list[int] = None,
    sigmas: list[float] = None,
    denoising_end: float | None = None,
    guidance_scale: float = 5.0,
    negative_prompt: str | list[str] | None = None,
    negative_prompt_2: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    pooled_prompt_embeds: torch.Tensor | None = None,
    negative_pooled_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    guidance_rescale: float = 0.0,
    original_size: tuple[int, int] | None = None,
    crops_coords_top_left: tuple[int, int] = (0, 0),
    target_size: tuple[int, int] | None = None,
    negative_original_size: tuple[int, int] | None = None,
    negative_crops_coords_top_left: tuple[int, int] = (0, 0),
    negative_target_size: tuple[int, int] | None = None,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionXLPipelineOutput | tuple:

Import

from diffusers import StableDiffusionXLPipeline

I/O Contract

Inputs

Name	Type	Required	Description
prompt	`str` or `list[str]`	Yes*	The text prompt(s) for image generation. Required unless `prompt_embeds` is provided.
prompt_2	`str` or `list[str]`	No	Separate prompt for the second text encoder. Defaults to `prompt`.
height	`int`	No	Height of the generated image in pixels. Defaults to `unet.config.sample_size * vae_scale_factor` (1024 for SDXL).
width	`int`	No	Width of the generated image in pixels. Defaults to `unet.config.sample_size * vae_scale_factor` (1024 for SDXL).
num_inference_steps	`int`	No	Number of denoising steps. More steps generally yield higher quality at the expense of speed. Defaults to 50.
guidance_scale	`float`	No	Classifier-free guidance scale. Higher values increase prompt adherence. Defaults to 5.0. Values above 1.0 enable guidance.
negative_prompt	`str` or `list[str]`	No	Prompt(s) describing what to avoid in the generated image. Used for classifier-free guidance.
generator	`torch.Generator` or `list[torch.Generator]`	No	PyTorch random number generator(s) for reproducible generation.
num_images_per_prompt	`int`	No	Number of images to generate per prompt. Defaults to 1.
output_type	`str`	No	Output format: `"pil"`, `"np"`, `"pt"`, or `"latent"`. Defaults to `"pil"`.
return_dict	`bool`	No	Whether to return a `StableDiffusionXLPipelineOutput` or a plain tuple. Defaults to `True`.
denoising_end	`float`	No	Fraction (0.0-1.0) of the denoising process to complete. Used for base+refiner pipeline setups.
guidance_rescale	`float`	No	Guidance rescale factor for zero-terminal-SNR correction. Defaults to 0.0 (disabled).
original_size	`tuple[int, int]`	No	SDXL micro-conditioning: original image size. Defaults to `(height, width)`.
crops_coords_top_left	`tuple[int, int]`	No	SDXL micro-conditioning: crop coordinates. Defaults to `(0, 0)`.
target_size	`tuple[int, int]`	No	SDXL micro-conditioning: target size. Defaults to `(height, width)`.
callback_on_step_end	`Callable` or `PipelineCallback`	No	Function called at the end of each denoising step for inspection or modification.

Outputs

Name	Type	Description
images	`list[PIL.Image.Image]` or `np.ndarray` or `torch.Tensor`	The generated images in the format specified by `output_type`. Wrapped in `StableDiffusionXLPipelineOutput` if `return_dict=True`.

Usage Examples

Basic Usage

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
).to("cuda")

# Simple text-to-image generation
result = pipe(
    prompt="An astronaut riding a horse on the moon, photorealistic",
    num_inference_steps=30,
    guidance_scale=7.5,
    generator=torch.manual_seed(42),
)
result.images[0].save("astronaut.png")

With Negative Prompt and Custom Size

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="A professional photo of a golden retriever in a garden",
    negative_prompt="blurry, low quality, distorted, watermark",
    height=1024,
    width=1024,
    num_inference_steps=40,
    guidance_scale=7.0,
    generator=torch.manual_seed(123),
).images[0]
image.save("golden_retriever.png")

With Step Callback for Progress Monitoring

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

def on_step_end(pipeline, step, timestep, callback_kwargs):
    print(f"Step {step}, timestep {timestep}")
    return callback_kwargs

image = pipe(
    prompt="A cyberpunk cityscape at night with neon lights",
    num_inference_steps=30,
    callback_on_step_end=on_step_end,
).images[0]

Related Pages

Implements Principle

Principle:Huggingface_Diffusers_Denoising_Loop

Requires Environment

Environment:Huggingface_Diffusers_PyTorch_CUDA_Runtime

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment