Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Diffusers ControlNet Img2Img Pipeline

From Leeroopedia
Property Value
Implementation Name ControlNet Img2Img and Inpaint Pipelines
Type API Doc
Workflow ControlNet_Guided_Generation
Related Principle Huggingface_Diffusers_ControlNet_Output_Refinement
Source File (Img2Img) src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py
Source File (Inpaint) src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
Status Active
Implements Principle:Huggingface_Diffusers_ControlNet_Output_Refinement

API Signatures

StableDiffusionControlNetImg2ImgPipeline.__call__

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    image: PipelineImageInput = None,
    control_image: PipelineImageInput = None,
    height: int | None = None,
    width: int | None = None,
    strength: float = 0.8,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    controlnet_conditioning_scale: float | list[float] = 0.8,
    guess_mode: bool = False,
    control_guidance_start: float | list[float] = 0.0,
    control_guidance_end: float | list[float] = 1.0,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionPipelineOutput | tuple:

StableDiffusionControlNetInpaintPipeline.__call__

@torch.no_grad()
def __call__(
    self,
    prompt: str | list[str] = None,
    image: PipelineImageInput = None,
    mask_image: PipelineImageInput = None,
    control_image: PipelineImageInput = None,
    height: int | None = None,
    width: int | None = None,
    padding_mask_crop: int | None = None,
    strength: float = 1.0,
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    negative_prompt: str | list[str] | None = None,
    num_images_per_prompt: int | None = 1,
    eta: float = 0.0,
    generator: torch.Generator | list[torch.Generator] | None = None,
    latents: torch.Tensor | None = None,
    prompt_embeds: torch.Tensor | None = None,
    negative_prompt_embeds: torch.Tensor | None = None,
    ip_adapter_image: PipelineImageInput | None = None,
    ip_adapter_image_embeds: list[torch.Tensor] | None = None,
    output_type: str | None = "pil",
    return_dict: bool = True,
    cross_attention_kwargs: dict[str, Any] | None = None,
    controlnet_conditioning_scale: float | list[float] = 0.5,
    guess_mode: bool = False,
    control_guidance_start: float | list[float] = 0.0,
    control_guidance_end: float | list[float] = 1.0,
    clip_skip: int | None = None,
    callback_on_step_end: Callable | PipelineCallback | MultiPipelineCallbacks | None = None,
    callback_on_step_end_tensor_inputs: list[str] = ["latents"],
    **kwargs,
) -> StableDiffusionPipelineOutput | tuple:

Import:

from diffusers import (
    ControlNetModel,
    StableDiffusionControlNetImg2ImgPipeline,
    StableDiffusionControlNetInpaintPipeline,
)

Key Parameters

Parameters Shared Across Both Pipelines

Parameter Type Default Description
prompt list[str] None Text prompt(s) guiding generation.
image PipelineImageInput None Reference/source image for img2img or inpainting.
control_image PipelineImageInput None ControlNet conditioning image (edges, depth, pose, etc.).
strength float 0.8 (img2img) / 1.0 (inpaint) Denoising strength. Higher values allow more transformation.
controlnet_conditioning_scale list[float] 0.8 (img2img) / 0.5 (inpaint) ControlNet influence multiplier.
guidance_scale float 7.5 Classifier-free guidance scale.
guess_mode bool False ControlNet recognizes content without text prompts.
control_guidance_start list[float] 0.0 When ControlNet begins applying.
control_guidance_end list[float] 1.0 When ControlNet stops applying.

Inpaint-Specific Parameters

Parameter Type Default Description
mask_image PipelineImageInput None Binary mask where white pixels indicate regions to repaint.
padding_mask_crop None None Crop padding around mask for higher-resolution inpainting of small regions.

Return Value

Type Description
StableDiffusionPipelineOutput Contains .images (list of PIL Images or numpy arrays) and .nsfw_content_detected.
tuple When return_dict=False: (images, nsfw_content_detected).

Execution Flow: Img2Img Pipeline

The img2img pipeline differs from text-to-image in these key steps:

1. Dual Image Processing

# Step 4: Preprocess the reference image (for latent initialization)
image = self.image_processor.preprocess(image, height=height, width=width).to(dtype=torch.float32)

# Step 5: Prepare the ControlNet conditioning image (separate from reference)
control_image = self.prepare_control_image(
    image=control_image,
    width=width, height=height,
    batch_size=batch_size * num_images_per_prompt,
    num_images_per_prompt=num_images_per_prompt,
    device=device, dtype=controlnet.dtype,
    do_classifier_free_guidance=self.do_classifier_free_guidance,
    guess_mode=guess_mode,
)

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 1146-1181.

2. Truncated Timestep Schedule

def get_timesteps(self, num_inference_steps, strength, device):
    init_timestep = min(int(num_inference_steps * strength), num_inference_steps)
    t_start = max(num_inference_steps - init_timestep, 0)
    timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
    return timesteps, num_inference_steps - t_start

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 801-810.

3. Latent Initialization from Reference Image

def prepare_latents(self, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
    image = image.to(device=device, dtype=dtype)
    batch_size = batch_size * num_images_per_prompt

    if image.shape[1] == 4:
        init_latents = image  # Already in latent space
    else:
        # Encode to latent space via VAE
        init_latents = retrieve_latents(self.vae.encode(image), generator=generator)
        init_latents = self.vae.config.scaling_factor * init_latents

    # Add noise at the starting timestep
    noise = randn_tensor(init_latents.shape, generator=generator, device=device, dtype=dtype)
    init_latents = self.scheduler.add_noise(init_latents, noise, timestep)
    return init_latents

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_img2img.py, lines 813-876.

Execution Flow: Inpaint Pipeline

1. Mask Processing

def prepare_mask_latents(self, mask, masked_image, batch_size, height, width, dtype, device, generator,
                         do_classifier_free_guidance):
    # Resize mask to latent dimensions
    mask = torch.nn.functional.interpolate(
        mask, size=(height // self.vae_scale_factor, width // self.vae_scale_factor)
    )
    mask = mask.to(device=device, dtype=dtype)

    # Encode masked image to latent space
    masked_image = masked_image.to(device=device, dtype=dtype)
    if masked_image.shape[1] == 4:
        masked_image_latents = masked_image
    else:
        masked_image_latents = self._encode_vae_image(masked_image, generator=generator)

    # Duplicate for CFG
    mask = torch.cat([mask] * 2) if do_classifier_free_guidance else mask
    masked_image_latents = torch.cat([masked_image_latents] * 2) if do_classifier_free_guidance else masked_image_latents

    return mask, masked_image_latents

Source: src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py, lines 907-950.

2. Denoising with Mask Application

During each denoising step, the mask is applied to blend the original latents (in unmasked regions) with the newly generated latents (in masked regions). The ControlNet conditioning guides the structure of the generated content within the masked area.

Usage Examples

Image-to-Image with Canny ControlNet

import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image

# Load source image
source_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

# Generate Canny edge map as control image
np_image = np.array(source_image)
canny_edges = cv2.Canny(np_image, 100, 200)
canny_edges = canny_edges[:, :, None]
canny_edges = np.concatenate([canny_edges, canny_edges, canny_edges], axis=2)
canny_image = Image.fromarray(canny_edges)

# Load pipeline
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# Generate with structural guidance
result = pipe(
    "futuristic-looking woman",
    image=source_image,          # Reference image for appearance
    control_image=canny_image,   # ControlNet conditioning for structure
    strength=0.8,
    num_inference_steps=20,
    controlnet_conditioning_scale=0.8,
    generator=torch.manual_seed(0),
).images[0]

Inpainting with ControlNet

import torch
from PIL import Image
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel
from diffusers.utils import load_image

# Load images
source_image = load_image("https://example.com/original.png")
mask_image = load_image("https://example.com/mask.png")      # White = repaint
control_image = load_image("https://example.com/edges.png")   # ControlNet condition

# Load pipeline
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-inpainting",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Inpaint with structural guidance
result = pipe(
    "a wooden bench in a park",
    image=source_image,
    mask_image=mask_image,
    control_image=control_image,
    strength=1.0,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.5,
).images[0]

Adjusting Strength for Different Effects

# Subtle style transfer -- preserve most of the original image
subtle_result = pipe_img2img(
    "oil painting style",
    image=photo,
    control_image=canny_of_photo,
    strength=0.3,
    controlnet_conditioning_scale=0.8,
    num_inference_steps=30,
).images[0]

# Strong transformation -- major style change with structural preservation
strong_result = pipe_img2img(
    "cyberpunk neon city",
    image=photo,
    control_image=canny_of_photo,
    strength=0.9,
    controlnet_conditioning_scale=1.0,
    num_inference_steps=30,
).images[0]

Pipeline Component Comparison

Component Text-to-Image Img2Img Inpaint
Pipeline Class StableDiffusionControlNetPipeline StableDiffusionControlNetImg2ImgPipeline StableDiffusionControlNetInpaintPipeline
Input image param N/A image (reference) image + mask_image
Control image param image control_image control_image
Default strength N/A 0.8 1.0
Default conditioning scale 1.0 0.8 0.5
Latent initialization Random noise Noised encoding of input Noised encoding of masked input
Callback tensors latents, prompt_embeds, image latents, prompt_embeds, control_image latents, prompt_embeds, control_image, mask, masked_image_latents

Notes

  • In the img2img pipeline, the parameter naming differs from text-to-image: the ControlNet condition is control_image (not image), since image is used for the reference image.
  • The inpainting pipeline uses strength=1.0 by default (full repainting of masked regions), while img2img uses strength=0.8.
  • The prepare_control_image method in the img2img pipeline is functionally identical to prepare_image in the text-to-image pipeline.
  • Both pipelines inherit from DiffusionPipeline, StableDiffusionMixin, TextualInversionLoaderMixin, StableDiffusionLoraLoaderMixin, IPAdapterMixin, and FromSingleFileMixin.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment