Implementation:Huggingface Diffusers ControlNet Pipeline Call
Appearance
| Property | Value |
|---|---|
| Implementation Name | StableDiffusionControlNetPipeline.__call__ |
| Type | API Doc |
| Workflow | ControlNet_Guided_Generation |
| Related Principle | Huggingface_Diffusers_Conditioning_Scale_Control |
| Source File | src/diffusers/pipelines/controlnet/pipeline_controlnet.py
|
| Lines | L909-L1336 |
| Status | Active |
| Implements | Principle:Huggingface_Diffusers_Conditioning_Scale_Control |
API Signature
@torch.no_grad()
def __call__(
self,
prompt: str | list[str] = None,
image: PipelineImageInput = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 50,
timesteps: list[int] = None,
sigmas: list[float] = None,
guidance_scale: float = 7.5,
negative_prompt: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 1.0,
guess_mode: bool = False,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
clip_skip: int | None = None,
callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
) -> StableDiffusionPipelineOutput | tuple:
Class: StableDiffusionControlNetPipeline
Import:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
list[str] | None |
Text prompt(s) guiding generation. Required unless prompt_embeds is provided.
|
image |
PipelineImageInput |
None |
ControlNet conditioning image(s). For MultiControlNet, pass a list of images. |
height |
None | Auto | Output image height. Defaults to unet.config.sample_size * vae_scale_factor.
|
width |
None | Auto | Output image width. Defaults to unet.config.sample_size * vae_scale_factor.
|
num_inference_steps |
int |
50 |
Number of denoising steps. |
guidance_scale |
float |
7.5 |
Classifier-free guidance scale. Higher values increase text adherence. |
controlnet_conditioning_scale |
list[float] | 1.0 |
ControlNet output multiplier. Controls spatial conditioning strength. List for MultiControlNet. |
guess_mode |
bool |
False |
ControlNet recognizes content without text prompts. Recommended guidance_scale 3.0-5.0.
|
control_guidance_start |
list[float] | 0.0 |
Fraction of steps at which ControlNet starts. List for MultiControlNet. |
control_guidance_end |
list[float] | 1.0 |
Fraction of steps at which ControlNet stops. List for MultiControlNet. |
negative_prompt |
list[str] | None | None |
Negative prompt for classifier-free guidance. |
num_images_per_prompt |
int |
1 |
Number of images per prompt. |
clip_skip |
None | None |
Number of CLIP layers to skip for prompt encoding. |
Return Value
| Type | Description |
|---|---|
StableDiffusionPipelineOutput |
Contains .images (list of PIL Images or numpy arrays) and .nsfw_content_detected.
|
tuple |
When return_dict=False, returns (images, nsfw_content_detected).
|
Execution Flow
The __call__ method executes the following stages:
1. Input Validation and Setup
# Align control guidance formats for single/multi ControlNet
if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
control_guidance_start = len(control_guidance_end) * [control_guidance_start]
# ... similar alignment for other combinations
# Validate all inputs
self.check_inputs(prompt, image, ..., controlnet_conditioning_scale,
control_guidance_start, control_guidance_end, ...)
2. Prompt Encoding
Text prompts are encoded via CLIP and optionally concatenated with negative prompt embeddings for CFG.
3. Control Image Preparation
# Single ControlNet
if isinstance(controlnet, ControlNetModel):
image = self.prepare_image(
image=image, width=width, height=height,
batch_size=batch_size * num_images_per_prompt,
num_images_per_prompt=num_images_per_prompt,
device=device, dtype=controlnet.dtype,
do_classifier_free_guidance=self.do_classifier_free_guidance,
guess_mode=guess_mode,
)
# MultiControlNet -- prepare each image separately
elif isinstance(controlnet, MultiControlNetModel):
images = []
for image_ in image:
image_ = self.prepare_image(image=image_, ...)
images.append(image_)
image = images
4. Timestep and Latent Preparation
Timesteps are retrieved from the scheduler. Random latent noise is generated or user-provided latents are scaled.
5. ControlNet Keep Mask Computation
controlnet_keep = []
for i in range(len(timesteps)):
keeps = [
1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
for s, e in zip(control_guidance_start, control_guidance_end)
]
controlnet_keep.append(keeps[0] if isinstance(controlnet, ControlNetModel) else keeps)
6. Denoising Loop
for i, t in enumerate(timesteps):
# Expand latents for CFG
latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# Guess mode: run ControlNet only on conditional batch
if guess_mode and self.do_classifier_free_guidance:
control_model_input = latents
control_model_input = self.scheduler.scale_model_input(control_model_input, t)
controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
else:
control_model_input = latent_model_input
controlnet_prompt_embeds = prompt_embeds
# Compute effective scale with keep mask
if isinstance(controlnet_keep[i], list):
cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
else:
controlnet_cond_scale = controlnet_conditioning_scale
if isinstance(controlnet_cond_scale, list):
controlnet_cond_scale = controlnet_cond_scale[0]
cond_scale = controlnet_cond_scale * controlnet_keep[i]
# ControlNet forward pass
down_block_res_samples, mid_block_res_sample = self.controlnet(
control_model_input, t,
encoder_hidden_states=controlnet_prompt_embeds,
controlnet_cond=image,
conditioning_scale=cond_scale,
guess_mode=guess_mode,
return_dict=False,
)
# Guess mode: pad unconditional batch with zeros
if guess_mode and self.do_classifier_free_guidance:
down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])
# UNet forward with ControlNet residuals
noise_pred = self.unet(
latent_model_input, t,
encoder_hidden_states=prompt_embeds,
down_block_additional_residuals=down_block_res_samples,
mid_block_additional_residual=mid_block_res_sample,
return_dict=False,
)[0]
# Classifier-free guidance
if self.do_classifier_free_guidance:
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
# Scheduler step
latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
Source: src/diffusers/pipelines/controlnet/pipeline_controlnet.py, lines 1250-1336.
Usage Examples
Basic Generation with Conditioning Scale
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet, torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
# Strong conditioning
result = pipe(
"a beautiful landscape",
image=canny_image,
controlnet_conditioning_scale=1.0,
num_inference_steps=30,
).images[0]
# Soft conditioning -- more creative freedom
result_soft = pipe(
"a beautiful landscape",
image=canny_image,
controlnet_conditioning_scale=0.5,
num_inference_steps=30,
).images[0]
Temporal Scheduling
# ControlNet active only for the first half of denoising
result = pipe(
"a futuristic city",
image=canny_image,
controlnet_conditioning_scale=1.0,
control_guidance_start=0.0,
control_guidance_end=0.5,
num_inference_steps=30,
).images[0]
Guess Mode
# ControlNet recognizes content without relying on the prompt
result = pipe(
"", # Empty or minimal prompt
image=canny_image,
guess_mode=True,
guidance_scale=3.5,
num_inference_steps=30,
).images[0]
MultiControlNet with Independent Scales
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet_canny = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
controlnet_depth = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=[controlnet_canny, controlnet_depth],
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
result = pipe(
"a detailed scene",
image=[canny_image, depth_image],
controlnet_conditioning_scale=[0.8, 0.4],
control_guidance_start=[0.0, 0.0],
control_guidance_end=[0.5, 1.0],
num_inference_steps=30,
).images[0]
Notes
- When
guidance_scale <= 1andunet.config.time_cond_proj_dim is None, classifier-free guidance is disabled, and the control image is not duplicated along the batch dimension. - The pipeline supports
callback_on_step_endfor intercepting intermediate results, including the control image tensor viacallback_on_step_end_tensor_inputs=["latents", "image"]. - For
MultiControlNetModel, a single floatcontrolnet_conditioning_scaleis automatically broadcast to all ControlNets.
Related Pages
- Huggingface_Diffusers_Conditioning_Scale_Control -- Principle: theory of conditioning strength and timing control
- Huggingface_Diffusers_Prepare_Control_Image -- How the control image is prepared before entering the loop
- Huggingface_Diffusers_ControlNetModel_Forward -- The ControlNet forward pass called within the loop
- Huggingface_Diffusers_ControlNet_Img2Img_Pipeline -- Variant pipeline for img2img with ControlNet
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment