Implementation:Huggingface Diffusers SDXL Pipeline Call
| Knowledge Sources | |
|---|---|
| Domains | Diffusion_Models, Denoising, Latent_Diffusion, Classifier_Free_Guidance |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Concrete tool for executing the full text-to-image generation pipeline including prompt encoding, denoising, and latent decoding provided by the Diffusers library.
Description
StableDiffusionXLPipeline.__call__ is the main entry point for generating images with SDXL. When a pipeline instance is called (e.g., pipe("a photo of a cat")), this method orchestrates the entire generation workflow:
- Input validation: Checks prompt types, dimensions, and parameter consistency.
- Prompt encoding: Calls
encode_promptwith both text encoders to produce conditional and unconditional embeddings. - Timestep preparation: Configures the scheduler with the requested number of inference steps.
- Latent initialization: Creates random Gaussian noise latents (or uses provided ones) at the correct shape for the UNet.
- Added conditioning: Computes SDXL-specific time IDs encoding original size, crop coordinates, and target size.
- Denoising loop: Iterates over timesteps, running the UNet with classifier-free guidance and the scheduler step function.
- VAE decoding: Unscales the denoised latents and decodes them through the VAE. Handles VAE upcasting to float32 when needed.
- Post-processing: Applies optional watermarking and converts the raw tensor to the requested output format via
VaeImageProcessor.postprocess. - Cleanup: Calls
maybe_free_model_hooksto offload models if CPU offloading is active.
The method supports numerous advanced features including custom timestep schedules, IP-Adapter image conditioning, denoising_end for refiner pipeline handoff, guidance rescale for zero-terminal-SNR correction, and step-end callbacks for intermediate inspection.
Usage
Call this method (via pipe(...)) to generate images from text prompts. This is the standard inference API for SDXL text-to-image generation. All parameters have sensible defaults, so minimal usage only requires a prompt string.
Code Reference
Source Location
- Repository: diffusers
- File:
src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py - Lines: 976-1301
Signature
@torch.no_grad()
def __call__(
self,
prompt: str | list[str] = None,
prompt_2: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 50,
timesteps: list[int] = None,
sigmas: list[float] = None,
denoising_end: float | None = None,
guidance_scale: float = 5.0,
negative_prompt: str | list[str] | None = None,
negative_prompt_2: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
pooled_prompt_embeds: torch.Tensor | None = None,
negative_pooled_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
guidance_rescale: float = 0.0,
original_size: tuple[int, int] | None = None,
crops_coords_top_left: tuple[int, int] = (0, 0),
target_size: tuple[int, int] | None = None,
negative_original_size: tuple[int, int] | None = None,
negative_crops_coords_top_left: tuple[int, int] = (0, 0),
negative_target_size: tuple[int, int] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
) -> StableDiffusionXLPipelineOutput | tuple:
Import
from diffusers import StableDiffusionXLPipeline
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str or list[str] |
Yes* | The text prompt(s) for image generation. Required unless prompt_embeds is provided.
|
| prompt_2 | str or list[str] |
No | Separate prompt for the second text encoder. Defaults to prompt.
|
| height | int |
No | Height of the generated image in pixels. Defaults to unet.config.sample_size * vae_scale_factor (1024 for SDXL).
|
| width | int |
No | Width of the generated image in pixels. Defaults to unet.config.sample_size * vae_scale_factor (1024 for SDXL).
|
| num_inference_steps | int |
No | Number of denoising steps. More steps generally yield higher quality at the expense of speed. Defaults to 50. |
| guidance_scale | float |
No | Classifier-free guidance scale. Higher values increase prompt adherence. Defaults to 5.0. Values above 1.0 enable guidance. |
| negative_prompt | str or list[str] |
No | Prompt(s) describing what to avoid in the generated image. Used for classifier-free guidance. |
| generator | torch.Generator or list[torch.Generator] |
No | PyTorch random number generator(s) for reproducible generation. |
| num_images_per_prompt | int |
No | Number of images to generate per prompt. Defaults to 1. |
| output_type | str |
No | Output format: "pil", "np", "pt", or "latent". Defaults to "pil".
|
| return_dict | bool |
No | Whether to return a StableDiffusionXLPipelineOutput or a plain tuple. Defaults to True.
|
| denoising_end | float |
No | Fraction (0.0-1.0) of the denoising process to complete. Used for base+refiner pipeline setups. |
| guidance_rescale | float |
No | Guidance rescale factor for zero-terminal-SNR correction. Defaults to 0.0 (disabled). |
| original_size | tuple[int, int] |
No | SDXL micro-conditioning: original image size. Defaults to (height, width).
|
| crops_coords_top_left | tuple[int, int] |
No | SDXL micro-conditioning: crop coordinates. Defaults to (0, 0).
|
| target_size | tuple[int, int] |
No | SDXL micro-conditioning: target size. Defaults to (height, width).
|
| callback_on_step_end | Callable or PipelineCallback |
No | Function called at the end of each denoising step for inspection or modification. |
Outputs
| Name | Type | Description |
|---|---|---|
| images | list[PIL.Image.Image] or np.ndarray or torch.Tensor |
The generated images in the format specified by output_type. Wrapped in StableDiffusionXLPipelineOutput if return_dict=True.
|
Usage Examples
Basic Usage
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
# Simple text-to-image generation
result = pipe(
prompt="An astronaut riding a horse on the moon, photorealistic",
num_inference_steps=30,
guidance_scale=7.5,
generator=torch.manual_seed(42),
)
result.images[0].save("astronaut.png")
With Negative Prompt and Custom Size
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
image = pipe(
prompt="A professional photo of a golden retriever in a garden",
negative_prompt="blurry, low quality, distorted, watermark",
height=1024,
width=1024,
num_inference_steps=40,
guidance_scale=7.0,
generator=torch.manual_seed(123),
).images[0]
image.save("golden_retriever.png")
With Step Callback for Progress Monitoring
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
def on_step_end(pipeline, step, timestep, callback_kwargs):
print(f"Step {step}, timestep {timestep}")
return callback_kwargs
image = pipe(
prompt="A cyberpunk cityscape at night with neon lights",
num_inference_steps=30,
callback_on_step_end=on_step_end,
).images[0]