Implementation:Huggingface Diffusers ControlNet Img2Img Pipeline

Property	Value
Implementation Name	ControlNet Img2Img and Inpaint Pipelines
Type	API Doc
Workflow	ControlNet_Guided_Generation
Related Principle	Huggingface_Diffusers_ControlNet_Output_Refinement
Source File (Img2Img)	`src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py`
Source File (Inpaint)	`src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py`
Status	Active
Implements	Principle:Huggingface_Diffusers_ControlNet_Output_Refinement

API Signatures

StableDiffusionControlNetImg2ImgPipeline.call

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    image: PipelineImageInput = None,
    control_image: PipelineImageInput = None,
    height: int | None = None,
    width: int | None = None,
    strength: float = 0.8,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    controlnet_conditioning_scale: float | list[float] = 0.8,
    guess_mode: bool = False,
    control_guidance_start: float | list[float] = 0.0,
    control_guidance_end: float | list[float] = 1.0,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionPipelineOutput | tuple:

StableDiffusionControlNetInpaintPipeline.call

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    image: PipelineImageInput = None,
    mask_image: PipelineImageInput = None,
    control_image: PipelineImageInput = None,
    height: int | None = None,
    width: int | None = None,
    padding_mask_crop: int | None = None,
    strength: float = 1.0,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    controlnet_conditioning_scale: float | list[float] = 0.5,
    guess_mode: bool = False,
    control_guidance_start: float | list[float] = 0.0,
    control_guidance_end: float | list[float] = 1.0,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionPipelineOutput | tuple:

Import:

from diffusers import (
    ControlNetModel,
    StableDiffusionControlNetImg2ImgPipeline,
    StableDiffusionControlNetInpaintPipeline,
)

Key Parameters

Parameters Shared Across Both Pipelines

Parameter	Type	Default	Description
`prompt`	list[str]	`None`	Text prompt(s) guiding generation.
`image`	`PipelineImageInput`	`None`	Reference/source image for img2img or inpainting.
`control_image`	`PipelineImageInput`	`None`	ControlNet conditioning image (edges, depth, pose, etc.).
`strength`	`float`	0.8 (img2img) / 1.0 (inpaint)	Denoising strength. Higher values allow more transformation.
`controlnet_conditioning_scale`	list[float]	0.8 (img2img) / 0.5 (inpaint)	ControlNet influence multiplier.
`guidance_scale`	`float`	`7.5`	Classifier-free guidance scale.
`guess_mode`	`bool`	`False`	ControlNet recognizes content without text prompts.
`control_guidance_start`	list[float]	`0.0`	When ControlNet begins applying.
`control_guidance_end`	list[float]	`1.0`	When ControlNet stops applying.

Inpaint-Specific Parameters

Parameter	Type	Default	Description
`mask_image`	`PipelineImageInput`	`None`	Binary mask where white pixels indicate regions to repaint.
`padding_mask_crop`	None	`None`	Crop padding around mask for higher-resolution inpainting of small regions.

Return Value

Type	Description
`StableDiffusionPipelineOutput`	Contains `.images` (list of PIL Images or numpy arrays) and `.nsfw_content_detected`.
`tuple`	When `return_dict=False`: `(images, nsfw_content_detected)`.

Execution Flow: Img2Img Pipeline

The img2img pipeline differs from text-to-image in these key steps:

1. Dual Image Processing

# Step 4: Preprocess the reference image (for latent initialization)
image = self.image_processor.preprocess(image, height=height, width=width).to(dtype=torch.float32)

# Step 5: Prepare the ControlNet conditioning image (separate from reference)
control_image = self.prepare_control_image(
    image=control_image,
    width=width, height=height,
    batch_size=batch_size * num_images_per_prompt,
    num_images_per_prompt=num_images_per_prompt,
    device=device, dtype=controlnet.dtype,
    do_classifier_free_guidance=self.do_classifier_free_guidance,
    guess_mode=guess_mode,
)

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 1146-1181.

2. Truncated Timestep Schedule

def get_timesteps(self, num_inference_steps, strength, device):
    init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
    t_start = max(num_inference_steps - init_timestep, 0)
    timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
    return timesteps, num_inference_steps - t_start

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 801-810.

3. Latent Initialization from Reference Image

def prepare_latents(self, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
    image = image.to(device=device, dtype=dtype)
    batch_size = batch_size * num_images_per_prompt

    if image.shape[1] == 4:
        init_latents = image  # Already in latent space
    else:
        # Encode to latent space via VAE
        init_latents = retrieve_latents(self.vae.encode(image), generator=generator)
        init_latents = self.vae.config.scaling_factor * init_latents

    # Add noise at the starting timestep
    noise = randn_tensor(init_latents.shape, generator=generator, device=device, dtype=dtype)
    init_latents = self.scheduler.add_noise(init_latents, noise, timestep)
    return init_latents

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 813-876.

Execution Flow: Inpaint Pipeline

1. Mask Processing

def prepare_mask_latents(self, mask, masked_image, batch_size, height, width, dtype, device, generator,
                         do_classifier_free_guidance):
    # Resize mask to latent dimensions
    mask = torch.nn.functional.interpolate(
        mask, size=(height // self.vae_scale_factor, width // self.vae_scale_factor)
    )
    mask = mask.to(device=device, dtype=dtype)

    # Encode masked image to latent space
    masked_image = masked_image.to(device=device, dtype=dtype)
    if masked_image.shape[1] == 4:
        masked_image_latents = masked_image
    else:
        masked_image_latents = self._encode_vae_image(masked_image, generator=generator)

    # Duplicate for CFG
    mask = torch.cat([mask] * 2) if do_classifier_free_guidance else mask
    masked_image_latents = torch.cat([masked_image_latents] * 2) if do_classifier_free_guidance else masked_image_latents

    return mask, masked_image_latents

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py, lines 907-950.

2. Denoising with Mask Application

During each denoising step, the mask is applied to blend the original latents (in unmasked regions) with the newly generated latents (in masked regions). The ControlNet conditioning guides the structure of the generated content within the masked area.

Usage Examples

Image-to-Image with Canny ControlNet

import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image

# Load source image
source_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

# Generate Canny edge map as control image
np_image = np.array(source_image)
canny_edges = cv2.Canny(np_image, 100, 200)
canny_edges = canny_edges[:, :, None]
canny_edges = np.concatenate([canny_edges, canny_edges, canny_edges], axis=2)
canny_image = Image.fromarray(canny_edges)

# Load pipeline
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# Generate with structural guidance
result = pipe(
    "futuristic-looking woman",
    image=source_image,          # Reference image for appearance
    control_image=canny_image,   # ControlNet conditioning for structure
    strength=0.8,
    num_inference_steps=20,
    controlnet_conditioning_scale=0.8,
    generator=torch.manual_seed(0),
).images[0]

Inpainting with ControlNet

import torch
from PIL import Image
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel
from diffusers.utils import load_image

# Load images
source_image = load_image("https://example.com/original.png")
mask_image = load_image("https://example.com/mask.png")      # White = repaint
control_image = load_image("https://example.com/edges.png")   # ControlNet condition

# Load pipeline
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-inpainting",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Inpaint with structural guidance
result = pipe(
    "a wooden bench in a park",
    image=source_image,
    mask_image=mask_image,
    control_image=control_image,
    strength=1.0,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.5,
).images[0]

Adjusting Strength for Different Effects

# Subtle style transfer -- preserve most of the original image
subtle_result = pipe_img2img(
    "oil painting style",
    image=photo,
    control_image=canny_of_photo,
    strength=0.3,
    controlnet_conditioning_scale=0.8,
    num_inference_steps=30,
).images[0]

# Strong transformation -- major style change with structural preservation
strong_result = pipe_img2img(
    "cyberpunk neon city",
    image=photo,
    control_image=canny_of_photo,
    strength=0.9,
    controlnet_conditioning_scale=1.0,
    num_inference_steps=30,
).images[0]

Pipeline Component Comparison

Component	Text-to-Image	Img2Img	Inpaint
Pipeline Class	`StableDiffusionControlNetPipeline`	`StableDiffusionControlNetImg2ImgPipeline`	`StableDiffusionControlNetInpaintPipeline`
Input image param	N/A	`image` (reference)	`image` + `mask_image`
Control image param	`image`	`control_image`	`control_image`
Default strength	N/A	0.8	1.0
Default conditioning scale	1.0	0.8	0.5
Latent initialization	Random noise	Noised encoding of input	Noised encoding of masked input
Callback tensors	latents, prompt_embeds, image	latents, prompt_embeds, control_image	latents, prompt_embeds, control_image, mask, masked_image_latents

Notes

In the img2img pipeline, the parameter naming differs from text-to-image: the ControlNet condition is control_image (not image), since image is used for the reference image.
The inpainting pipeline uses strength=1.0 by default (full repainting of masked regions), while img2img uses strength=0.8.
The prepare_control_image method in the img2img pipeline is functionally identical to prepare_image in the text-to-image pipeline.
Both pipelines inherit from DiffusionPipeline, StableDiffusionMixin, TextualInversionLoaderMixin, StableDiffusionLoraLoaderMixin, IPAdapterMixin, and FromSingleFileMixin.

Related Pages

Huggingface_Diffusers_ControlNet_Output_Refinement -- Principle: theory of ControlNet-guided refinement
Huggingface_Diffusers_ControlNet_Pipeline_Call -- The base text-to-image ControlNet pipeline
Huggingface_Diffusers_Prepare_Control_Image -- How control images are prepared in these pipelines
Huggingface_Diffusers_Conditioning_Scale_Control -- How conditioning scale interacts with strength

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment