Workflow:Huggingface Diffusers ControlNet Guided Generation

Knowledge Sources	Diffusers ControlNet Guide ControlNet
Domains	Diffusion_Models, Conditional_Generation, ControlNet
Last Updated	2026-02-13 21:00 GMT

Overview

End-to-end process for generating images guided by spatial conditioning signals (edges, depth maps, poses) using ControlNet alongside a diffusion pipeline.

Description

This workflow demonstrates how to use ControlNet models to add precise spatial control over image generation. ControlNet is a neural network architecture that takes a conditioning image (such as a Canny edge map, depth map, human pose skeleton, or segmentation mask) and produces residual features that guide the base diffusion model's denoising process. The workflow covers preparing conditioning images, loading ControlNet models alongside the base pipeline, configuring the conditioning strength and guidance schedule, and running inference with both text and spatial conditioning. Multiple ControlNets can be combined simultaneously for multi-modal control. The workflow also covers training a new ControlNet from scratch on custom conditioning data.

Usage

Execute this workflow when you need precise spatial control over the generated image layout, structure, or composition. This applies when you have a reference image that defines the spatial structure (edges, depth, pose) and want to generate a new image that follows that structure while matching a text description. Common use cases include generating images that follow a specific composition, maintaining consistent character poses across multiple generations, converting sketches to photorealistic images, and architectural rendering from floor plans.

Execution Steps

Step 1: Conditioning Image Preparation

Create or extract the conditioning signal from a reference image. This involves applying an image processing algorithm to produce a structured representation of the desired spatial layout. Different ControlNet models expect different types of conditioning inputs.

Key considerations:

Canny edge detection extracts edge maps for structural control
Depth estimation (MiDaS, Marigold) produces depth maps for 3D-aware generation
OpenPose extracts human body/hand/face keypoints for pose control
Segmentation masks provide semantic region control
Normal maps, scribbles, and lineart are also supported conditioning types
The conditioning image must match the target output resolution

Step 2: Model Loading

Load the base diffusion pipeline and the ControlNet model(s) as separate components, then combine them into a ControlNet pipeline. The ControlNet model is loaded independently and passed to the pipeline constructor alongside the base model components.

Key considerations:

ControlNet models are architecture-specific (SD 1.5, SDXL, SD3, Flux each have separate ControlNets)
Multiple ControlNets can be wrapped in a MultiControlNetModel for simultaneous use
Load with appropriate dtype (float16) for memory efficiency
The ControlNet must match the base model architecture
ControlNet-Union models support multiple conditioning types in a single model

Step 3: Conditioning Configuration

Configure how strongly and when the ControlNet conditioning influences the generation process. The conditioning scale controls the overall influence, while the guidance start and end parameters define the timestep range during which the ControlNet is active.

Key considerations:

controlnet_conditioning_scale (default 1.0) controls ControlNet influence strength
Lower values (0.5-0.7) produce more creative results; higher values (1.0-1.5) produce stricter adherence
control_guidance_start and control_guidance_end define when ControlNet is active during denoising
Applying ControlNet only in early steps (e.g., 0.0-0.5) gives structural guidance while allowing creative freedom in later steps
Guess mode allows ControlNet to infer conditioning without text prompts

Step 4: Inference Execution

Run the ControlNet-augmented inference pipeline. At each denoising step, the conditioning image is passed through the ControlNet to produce down-block and mid-block residual features. These residuals are injected into the corresponding layers of the base UNet during its forward pass, steering the generation toward the desired spatial structure while following the text prompt.

Key considerations:

The ControlNet forward pass runs before each UNet step, adding computational overhead
Multiple conditioning images are processed independently and their contributions are summed
Classifier-free guidance still applies alongside ControlNet conditioning
IP-Adapter can be combined with ControlNet for additional image-based style guidance
Memory optimization (CPU offloading, attention slicing) is especially important due to the additional model

Step 5: Output Refinement

Process and evaluate the generated image. Optionally apply img2img refinement, upscaling, or additional ControlNet passes for iterative improvement. Compare the output against the conditioning image to verify structural adherence.

Key considerations:

ControlNet img2img pipelines allow refining an existing image with conditioning
ControlNet inpaint pipelines enable selective regeneration of masked regions
Tile ControlNet can be used for super-resolution while maintaining consistency
Multiple generation passes with different random seeds can be compared for best results

Execution Diagram

GitHub URL

Workflow Repository