Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Diffusers SDXL Pipeline Call

From Leeroopedia
Knowledge Sources
Domains Diffusion_Models, Denoising, Latent_Diffusion, Classifier_Free_Guidance
Last Updated 2026-02-13 21:00 GMT

Overview

Concrete tool for executing the full text-to-image generation pipeline including prompt encoding, denoising, and latent decoding provided by the Diffusers library.

Description

StableDiffusionXLPipeline.__call__ is the main entry point for generating images with SDXL. When a pipeline instance is called (e.g., pipe("a photo of a cat")), this method orchestrates the entire generation workflow:

  1. Input validation: Checks prompt types, dimensions, and parameter consistency.
  2. Prompt encoding: Calls encode_prompt with both text encoders to produce conditional and unconditional embeddings.
  3. Timestep preparation: Configures the scheduler with the requested number of inference steps.
  4. Latent initialization: Creates random Gaussian noise latents (or uses provided ones) at the correct shape for the UNet.
  5. Added conditioning: Computes SDXL-specific time IDs encoding original size, crop coordinates, and target size.
  6. Denoising loop: Iterates over timesteps, running the UNet with classifier-free guidance and the scheduler step function.
  7. VAE decoding: Unscales the denoised latents and decodes them through the VAE. Handles VAE upcasting to float32 when needed.
  8. Post-processing: Applies optional watermarking and converts the raw tensor to the requested output format via VaeImageProcessor.postprocess.
  9. Cleanup: Calls maybe_free_model_hooks to offload models if CPU offloading is active.

The method supports numerous advanced features including custom timestep schedules, IP-Adapter image conditioning, denoising_end for refiner pipeline handoff, guidance rescale for zero-terminal-SNR correction, and step-end callbacks for intermediate inspection.

Usage

Call this method (via pipe(...)) to generate images from text prompts. This is the standard inference API for SDXL text-to-image generation. All parameters have sensible defaults, so minimal usage only requires a prompt string.

Code Reference

Source Location

  • Repository: diffusers
  • File: src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py
  • Lines: 976-1301

Signature

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    prompt_2: str | list[str] | None = None,
    height: int | None = None,
    width: int | None = None,
    num_inference_steps: int = 50,
    timesteps: list[int] = None,
    sigmas: list[float] = None,
    denoising_end: float | None = None,
    guidance_scale: float = 5.0,
    negative_prompt: str | list[str] | None = None,
    negative_prompt_2: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    pooled_prompt_embeds: torch.Tensor | None = None,
    negative_pooled_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    guidance_rescale: float = 0.0,
    original_size: tuple[int, int] | None = None,
    crops_coords_top_left: tuple[int, int] = (0, 0),
    target_size: tuple[int, int] | None = None,
    negative_original_size: tuple[int, int] | None = None,
    negative_crops_coords_top_left: tuple[int, int] = (0, 0),
    negative_target_size: tuple[int, int] | None = None,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionXLPipelineOutput | tuple:

Import

from diffusers import StableDiffusionXLPipeline

I/O Contract

Inputs

Name Type Required Description
prompt str or list[str] Yes* The text prompt(s) for image generation. Required unless prompt_embeds is provided.
prompt_2 str or list[str] No Separate prompt for the second text encoder. Defaults to prompt.
height int No Height of the generated image in pixels. Defaults to unet.config.sample_size * vae_scale_factor (1024 for SDXL).
width int No Width of the generated image in pixels. Defaults to unet.config.sample_size * vae_scale_factor (1024 for SDXL).
num_inference_steps int No Number of denoising steps. More steps generally yield higher quality at the expense of speed. Defaults to 50.
guidance_scale float No Classifier-free guidance scale. Higher values increase prompt adherence. Defaults to 5.0. Values above 1.0 enable guidance.
negative_prompt str or list[str] No Prompt(s) describing what to avoid in the generated image. Used for classifier-free guidance.
generator torch.Generator or list[torch.Generator] No PyTorch random number generator(s) for reproducible generation.
num_images_per_prompt int No Number of images to generate per prompt. Defaults to 1.
output_type str No Output format: "pil", "np", "pt", or "latent". Defaults to "pil".
return_dict bool No Whether to return a StableDiffusionXLPipelineOutput or a plain tuple. Defaults to True.
denoising_end float No Fraction (0.0-1.0) of the denoising process to complete. Used for base+refiner pipeline setups.
guidance_rescale float No Guidance rescale factor for zero-terminal-SNR correction. Defaults to 0.0 (disabled).
original_size tuple[int, int] No SDXL micro-conditioning: original image size. Defaults to (height, width).
crops_coords_top_left tuple[int, int] No SDXL micro-conditioning: crop coordinates. Defaults to (0, 0).
target_size tuple[int, int] No SDXL micro-conditioning: target size. Defaults to (height, width).
callback_on_step_end Callable or PipelineCallback No Function called at the end of each denoising step for inspection or modification.

Outputs

Name Type Description
images list[PIL.Image.Image] or np.ndarray or torch.Tensor The generated images in the format specified by output_type. Wrapped in StableDiffusionXLPipelineOutput if return_dict=True.

Usage Examples

Basic Usage

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
).to("cuda")

# Simple text-to-image generation
result = pipe(
    prompt="An astronaut riding a horse on the moon, photorealistic",
    num_inference_steps=30,
    guidance_scale=7.5,
    generator=torch.manual_seed(42),
)
result.images[0].save("astronaut.png")

With Negative Prompt and Custom Size

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="A professional photo of a golden retriever in a garden",
    negative_prompt="blurry, low quality, distorted, watermark",
    height=1024,
    width=1024,
    num_inference_steps=40,
    guidance_scale=7.0,
    generator=torch.manual_seed(123),
).images[0]
image.save("golden_retriever.png")

With Step Callback for Progress Monitoring

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

def on_step_end(pipeline, step, timestep, callback_kwargs):
    print(f"Step {step}, timestep {timestep}")
    return callback_kwargs

image = pipe(
    prompt="A cyberpunk cityscape at night with neon lights",
    num_inference_steps=30,
    callback_on_step_end=on_step_end,
).images[0]

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment