Implementation:Huggingface Diffusers ControlNet Img2Img Pipeline
| Property | Value |
|---|---|
| Implementation Name | ControlNet Img2Img and Inpaint Pipelines |
| Type | API Doc |
| Workflow | ControlNet_Guided_Generation |
| Related Principle | Huggingface_Diffusers_ControlNet_Output_Refinement |
| Source File (Img2Img) | src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py
|
| Source File (Inpaint) | src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
|
| Status | Active |
| Implements | Principle:Huggingface_Diffusers_ControlNet_Output_Refinement |
API Signatures
StableDiffusionControlNetImg2ImgPipeline.__call__
@torch.no_grad()
def __call__(
self,
prompt: str | list[str] = None,
image: PipelineImageInput = None,
control_image: PipelineImageInput = None,
height: int | None = None,
width: int | None = None,
strength: float = 0.8,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
negative_prompt: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 0.8,
guess_mode: bool = False,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
clip_skip: int | None = None,
callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
) -> StableDiffusionPipelineOutput | tuple:
StableDiffusionControlNetInpaintPipeline.__call__
@torch.no_grad()
def __call__(
self,
prompt: str | list[str] = None,
image: PipelineImageInput = None,
mask_image: PipelineImageInput = None,
control_image: PipelineImageInput = None,
height: int | None = None,
width: int | None = None,
padding_mask_crop: int | None = None,
strength: float = 1.0,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
negative_prompt: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 0.5,
guess_mode: bool = False,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
clip_skip: int | None = None,
callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,
) -> StableDiffusionPipelineOutput | tuple:
Import:
from diffusers import (
ControlNetModel,
StableDiffusionControlNetImg2ImgPipeline,
StableDiffusionControlNetInpaintPipeline,
)
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
list[str] | None |
Text prompt(s) guiding generation. |
image |
PipelineImageInput |
None |
Reference/source image for img2img or inpainting. |
control_image |
PipelineImageInput |
None |
ControlNet conditioning image (edges, depth, pose, etc.). |
strength |
float |
0.8 (img2img) / 1.0 (inpaint) | Denoising strength. Higher values allow more transformation. |
controlnet_conditioning_scale |
list[float] | 0.8 (img2img) / 0.5 (inpaint) | ControlNet influence multiplier. |
guidance_scale |
float |
7.5 |
Classifier-free guidance scale. |
guess_mode |
bool |
False |
ControlNet recognizes content without text prompts. |
control_guidance_start |
list[float] | 0.0 |
When ControlNet begins applying. |
control_guidance_end |
list[float] | 1.0 |
When ControlNet stops applying. |
Inpaint-Specific Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mask_image |
PipelineImageInput |
None |
Binary mask where white pixels indicate regions to repaint. |
padding_mask_crop |
None | None |
Crop padding around mask for higher-resolution inpainting of small regions. |
Return Value
| Type | Description |
|---|---|
StableDiffusionPipelineOutput |
Contains .images (list of PIL Images or numpy arrays) and .nsfw_content_detected.
|
tuple |
When return_dict=False: (images, nsfw_content_detected).
|
Execution Flow: Img2Img Pipeline
The img2img pipeline differs from text-to-image in these key steps:
1. Dual Image Processing
# Step 4: Preprocess the reference image (for latent initialization)
image = self.image_processor.preprocess(image, height=height, width=width).to(dtype=torch.float32)
# Step 5: Prepare the ControlNet conditioning image (separate from reference)
control_image = self.prepare_control_image(
image=control_image,
width=width, height=height,
batch_size=batch_size * num_images_per_prompt,
num_images_per_prompt=num_images_per_prompt,
device=device, dtype=controlnet.dtype,
do_classifier_free_guidance=self.do_classifier_free_guidance,
guess_mode=guess_mode,
)
Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 1146-1181.
2. Truncated Timestep Schedule
def get_timesteps(self, num_inference_steps, strength, device):
init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
t_start = max(num_inference_steps - init_timestep, 0)
timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
return timesteps, num_inference_steps - t_start
Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 801-810.
3. Latent Initialization from Reference Image
def prepare_latents(self, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
image = image.to(device=device, dtype=dtype)
batch_size = batch_size * num_images_per_prompt
if image.shape[1] == 4:
init_latents = image # Already in latent space
else:
# Encode to latent space via VAE
init_latents = retrieve_latents(self.vae.encode(image), generator=generator)
init_latents = self.vae.config.scaling_factor * init_latents
# Add noise at the starting timestep
noise = randn_tensor(init_latents.shape, generator=generator, device=device, dtype=dtype)
init_latents = self.scheduler.add_noise(init_latents, noise, timestep)
return init_latents
Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 813-876.
Execution Flow: Inpaint Pipeline
1. Mask Processing
def prepare_mask_latents(self, mask, masked_image, batch_size, height, width, dtype, device, generator,
do_classifier_free_guidance):
# Resize mask to latent dimensions
mask = torch.nn.functional.interpolate(
mask, size=(height // self.vae_scale_factor, width // self.vae_scale_factor)
)
mask = mask.to(device=device, dtype=dtype)
# Encode masked image to latent space
masked_image = masked_image.to(device=device, dtype=dtype)
if masked_image.shape[1] == 4:
masked_image_latents = masked_image
else:
masked_image_latents = self._encode_vae_image(masked_image, generator=generator)
# Duplicate for CFG
mask = torch.cat([mask] * 2) if do_classifier_free_guidance else mask
masked_image_latents = torch.cat([masked_image_latents] * 2) if do_classifier_free_guidance else masked_image_latents
return mask, masked_image_latents
Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py, lines 907-950.
2. Denoising with Mask Application
During each denoising step, the mask is applied to blend the original latents (in unmasked regions) with the newly generated latents (in masked regions). The ControlNet conditioning guides the structure of the generated content within the masked area.
Usage Examples
Image-to-Image with Canny ControlNet
import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
# Load source image
source_image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
# Generate Canny edge map as control image
np_image = np.array(source_image)
canny_edges = cv2.Canny(np_image, 100, 200)
canny_edges = canny_edges[:, :, None]
canny_edges = np.concatenate([canny_edges, canny_edges, canny_edges], axis=2)
canny_image = Image.fromarray(canny_edges)
# Load pipeline
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# Generate with structural guidance
result = pipe(
"futuristic-looking woman",
image=source_image, # Reference image for appearance
control_image=canny_image, # ControlNet conditioning for structure
strength=0.8,
num_inference_steps=20,
controlnet_conditioning_scale=0.8,
generator=torch.manual_seed(0),
).images[0]
Inpainting with ControlNet
import torch
from PIL import Image
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel
from diffusers.utils import load_image
# Load images
source_image = load_image("https://example.com/original.png")
mask_image = load_image("https://example.com/mask.png") # White = repaint
control_image = load_image("https://example.com/edges.png") # ControlNet condition
# Load pipeline
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-inpainting",
controlnet=controlnet,
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
# Inpaint with structural guidance
result = pipe(
"a wooden bench in a park",
image=source_image,
mask_image=mask_image,
control_image=control_image,
strength=1.0,
num_inference_steps=30,
controlnet_conditioning_scale=0.5,
).images[0]
Adjusting Strength for Different Effects
# Subtle style transfer -- preserve most of the original image
subtle_result = pipe_img2img(
"oil painting style",
image=photo,
control_image=canny_of_photo,
strength=0.3,
controlnet_conditioning_scale=0.8,
num_inference_steps=30,
).images[0]
# Strong transformation -- major style change with structural preservation
strong_result = pipe_img2img(
"cyberpunk neon city",
image=photo,
control_image=canny_of_photo,
strength=0.9,
controlnet_conditioning_scale=1.0,
num_inference_steps=30,
).images[0]
Pipeline Component Comparison
| Component | Text-to-Image | Img2Img | Inpaint |
|---|---|---|---|
| Pipeline Class | StableDiffusionControlNetPipeline |
StableDiffusionControlNetImg2ImgPipeline |
StableDiffusionControlNetInpaintPipeline
|
| Input image param | N/A | image (reference) |
image + mask_image
|
| Control image param | image |
control_image |
control_image
|
| Default strength | N/A | 0.8 | 1.0 |
| Default conditioning scale | 1.0 | 0.8 | 0.5 |
| Latent initialization | Random noise | Noised encoding of input | Noised encoding of masked input |
| Callback tensors | latents, prompt_embeds, image | latents, prompt_embeds, control_image | latents, prompt_embeds, control_image, mask, masked_image_latents |
Notes
- In the img2img pipeline, the parameter naming differs from text-to-image: the ControlNet condition is
control_image(notimage), sinceimageis used for the reference image. - The inpainting pipeline uses
strength=1.0by default (full repainting of masked regions), while img2img usesstrength=0.8. - The
prepare_control_imagemethod in the img2img pipeline is functionally identical toprepare_imagein the text-to-image pipeline. - Both pipelines inherit from
DiffusionPipeline,StableDiffusionMixin,TextualInversionLoaderMixin,StableDiffusionLoraLoaderMixin,IPAdapterMixin, andFromSingleFileMixin.
Related Pages
- Huggingface_Diffusers_ControlNet_Output_Refinement -- Principle: theory of ControlNet-guided refinement
- Huggingface_Diffusers_ControlNet_Pipeline_Call -- The base text-to-image ControlNet pipeline
- Huggingface_Diffusers_Prepare_Control_Image -- How control images are prepared in these pipelines
- Huggingface_Diffusers_Conditioning_Scale_Control -- How conditioning scale interacts with strength