Implementation:Huggingface Diffusers ControlNet Pipeline Call

Property	Value
Implementation Name	StableDiffusionControlNetPipeline.__call__
Type	API Doc
Workflow	ControlNet_Guided_Generation
Related Principle	Huggingface_Diffusers_Conditioning_Scale_Control
Source File	`src/diffusers/pipelines/controlnet/pipeline_controlnet.py`
Lines	L909-L1336
Status	Active
Implements	Principle:Huggingface_Diffusers_Conditioning_Scale_Control

API Signature

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    image: PipelineImageInput = None,
    height: int | None = None,
    width: int | None = None,
    num_inference_steps: int = 50,
    timesteps: list[int] = None,
    sigmas: list[float] = None,
    guidance_scale: float = 7.5,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    controlnet_conditioning_scale: float | list[float] = 1.0,
    guess_mode: bool = False,
    control_guidance_start: float | list[float] = 0.0,
    control_guidance_end: float | list[float] = 1.0,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionPipelineOutput | tuple:

Class: StableDiffusionControlNetPipeline

Import:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

Key Parameters

Parameter	Type	Default	Description
`prompt`	list[str]	`None`	Text prompt(s) guiding generation. Required unless `prompt_embeds` is provided.
`image`	`PipelineImageInput`	`None`	ControlNet conditioning image(s). For MultiControlNet, pass a list of images.
`height`	None	Auto	Output image height. Defaults to `unet.config.sample_size * vae_scale_factor`.
`width`	None	Auto	Output image width. Defaults to `unet.config.sample_size * vae_scale_factor`.
`num_inference_steps`	`int`	`50`	Number of denoising steps.
`guidance_scale`	`float`	`7.5`	Classifier-free guidance scale. Higher values increase text adherence.
`controlnet_conditioning_scale`	list[float]	`1.0`	ControlNet output multiplier. Controls spatial conditioning strength. List for MultiControlNet.
`guess_mode`	`bool`	`False`	ControlNet recognizes content without text prompts. Recommended `guidance_scale` 3.0-5.0.
`control_guidance_start`	list[float]	`0.0`	Fraction of steps at which ControlNet starts. List for MultiControlNet.
`control_guidance_end`	list[float]	`1.0`	Fraction of steps at which ControlNet stops. List for MultiControlNet.
`negative_prompt`	list[str] \| None	`None`	Negative prompt for classifier-free guidance.
`num_images_per_prompt`	`int`	`1`	Number of images per prompt.
`clip_skip`	None	`None`	Number of CLIP layers to skip for prompt encoding.

Return Value

Type	Description
`StableDiffusionPipelineOutput`	Contains `.images` (list of PIL Images or numpy arrays) and `.nsfw_content_detected`.
`tuple`	When `return_dict=False`, returns `(images, nsfw_content_detected)`.

Execution Flow

The __call__ method executes the following stages:

1. Input Validation and Setup

# Align control guidance formats for single/multi ControlNet
if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
    control_guidance_start = len(control_guidance_end) * [control_guidance_start]
# ... similar alignment for other combinations

# Validate all inputs
self.check_inputs(prompt, image, ..., controlnet_conditioning_scale,
                  control_guidance_start, control_guidance_end, ...)

2. Prompt Encoding

Text prompts are encoded via CLIP and optionally concatenated with negative prompt embeddings for CFG.

3. Control Image Preparation

# Single ControlNet
if isinstance(controlnet, ControlNetModel):
    image = self.prepare_image(
        image=image, width=width, height=height,
        batch_size=batch_size * num_images_per_prompt,
        num_images_per_prompt=num_images_per_prompt,
        device=device, dtype=controlnet.dtype,
        do_classifier_free_guidance=self.do_classifier_free_guidance,
        guess_mode=guess_mode,
    )

# MultiControlNet -- prepare each image separately
elif isinstance(controlnet, MultiControlNetModel):
    images = []
    for image_ in image:
        image_ = self.prepare_image(image=image_, ...)
        images.append(image_)
    image = images

4. Timestep and Latent Preparation

Timesteps are retrieved from the scheduler. Random latent noise is generated or user-provided latents are scaled.

5. ControlNet Keep Mask Computation

controlnet_keep = []
for i in range(len(timesteps)):
    keeps = [
        1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
        for s, e in zip(control_guidance_start, control_guidance_end)
    ]
    controlnet_keep.append(keeps[0] if isinstance(controlnet, ControlNetModel) else keeps)

6. Denoising Loop

for i, t in enumerate(timesteps):
    # Expand latents for CFG
    latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
    latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

    # Guess mode: run ControlNet only on conditional batch
    if guess_mode and self.do_classifier_free_guidance:
        control_model_input = latents
        control_model_input = self.scheduler.scale_model_input(control_model_input, t)
        controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
    else:
        control_model_input = latent_model_input
        controlnet_prompt_embeds = prompt_embeds

    # Compute effective scale with keep mask
    if isinstance(controlnet_keep[i], list):
        cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
    else:
        controlnet_cond_scale = controlnet_conditioning_scale
        if isinstance(controlnet_cond_scale, list):
            controlnet_cond_scale = controlnet_cond_scale[0]
        cond_scale = controlnet_cond_scale * controlnet_keep[i]

    # ControlNet forward pass
    down_block_res_samples, mid_block_res_sample = self.controlnet(
        control_model_input, t,
        encoder_hidden_states=controlnet_prompt_embeds,
        controlnet_cond=image,
        conditioning_scale=cond_scale,
        guess_mode=guess_mode,
        return_dict=False,
    )

    # Guess mode: pad unconditional batch with zeros
    if guess_mode and self.do_classifier_free_guidance:
        down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
        mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])

    # UNet forward with ControlNet residuals
    noise_pred = self.unet(
        latent_model_input, t,
        encoder_hidden_states=prompt_embeds,
        down_block_additional_residuals=down_block_res_samples,
        mid_block_additional_residual=mid_block_res_sample,
        return_dict=False,
    )[0]

    # Classifier-free guidance
    if self.do_classifier_free_guidance:
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)

    # Scheduler step
    latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet.py, lines 1250-1336.

Usage Examples

Basic Generation with Conditioning Scale

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet, torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Strong conditioning
result = pipe(
    "a beautiful landscape",
    image=canny_image,
    controlnet_conditioning_scale=1.0,
    num_inference_steps=30,
).images[0]

# Soft conditioning -- more creative freedom
result_soft = pipe(
    "a beautiful landscape",
    image=canny_image,
    controlnet_conditioning_scale=0.5,
    num_inference_steps=30,
).images[0]

Temporal Scheduling

# ControlNet active only for the first half of denoising
result = pipe(
    "a futuristic city",
    image=canny_image,
    controlnet_conditioning_scale=1.0,
    control_guidance_start=0.0,
    control_guidance_end=0.5,
    num_inference_steps=30,
).images[0]

Guess Mode

# ControlNet recognizes content without relying on the prompt
result = pipe(
    "",  # Empty or minimal prompt
    image=canny_image,
    guess_mode=True,
    guidance_scale=3.5,
    num_inference_steps=30,
).images[0]

MultiControlNet with Independent Scales

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=[controlnet_canny, controlnet_depth],
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

result = pipe(
    "a detailed scene",
    image=[canny_image, depth_image],
    controlnet_conditioning_scale=[0.8, 0.4],
    control_guidance_start=[0.0, 0.0],
    control_guidance_end=[0.5, 1.0],
    num_inference_steps=30,
).images[0]

Notes

When guidance_scale <= 1 and unet.config.time_cond_proj_dim is None, classifier-free guidance is disabled, and the control image is not duplicated along the batch dimension.
The pipeline supports callback_on_step_end for intercepting intermediate results, including the control image tensor via callback_on_step_end_tensor_inputs=["latents", "image"].
For MultiControlNetModel, a single float controlnet_conditioning_scale is automatically broadcast to all ControlNets.

Related Pages

Huggingface_Diffusers_Conditioning_Scale_Control -- Principle: theory of conditioning strength and timing control
Huggingface_Diffusers_Prepare_Control_Image -- How the control image is prepared before entering the loop
Huggingface_Diffusers_ControlNetModel_Forward -- The ControlNet forward pass called within the loop
Huggingface_Diffusers_ControlNet_Img2Img_Pipeline -- Variant pipeline for img2img with ControlNet

Requires Environment

Environment:Huggingface_Diffusers_PyTorch_CUDA_Runtime

Uses Heuristic

Heuristic:Huggingface_Diffusers_Guidance_Scale_Defaults

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment