Principle:AUTOMATIC1111 Stable diffusion webui Noise addition and guided denoising
| Knowledge Sources | |
|---|---|
| Domains | Diffusion Models, Image Generation, Image Editing, Inpainting |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Noise addition and guided denoising is the core diffusion process in image-to-image generation where calibrated noise is added to the encoded source image and then iteratively removed under text guidance, with the denoising strength controlling the balance between fidelity to the source and creative freedom.
Description
The image-to-image sampling process implements the SDEdit algorithm: rather than starting from pure noise (as in text-to-image), it starts from the source image encoded in latent space with a controlled amount of noise added. The diffusion model then denoises this noisy latent back to a clean image, guided by the text prompt.
The process involves three key stages:
1. Noise Generation: A noise tensor is generated using the ImageRNG system, which supports deterministic seeded noise, subseed blending via spherical linear interpolation (slerp), and seed-resize functionality for resolution-independent seeds. The noise may be scaled by an initial_noise_multiplier to adjust the overall noise amplitude.
2. Sampler Invocation: The sampler's sample_img2img() method receives the initial latent, the noise tensor, and the text conditioning (positive and negative). Internally, the sampler:
- Determines the starting timestep based on denoising strength
- Adds noise to the init_latent corresponding to that timestep
- Iteratively denoises from the starting timestep to timestep 0
- Uses classifier-free guidance to steer generation toward the text prompt
3. Mask Compositing: After sampling, if a mask is present, the denoised samples are composited with the original init_latent using the mask tensors. The formula blends the generated content in masked regions with the preserved original content in unmasked regions. Script hooks (on_mask_blend) can modify this blending behavior.
Usage
The sampling stage is the computational bottleneck of image-to-image generation. Key considerations:
- Denoising strength directly controls the starting timestep. Lower values mean fewer denoising steps and closer fidelity to the source.
- Initial noise multiplier provides fine-grained control over noise amplitude independently of the timestep schedule.
- Sampler choice affects both quality and speed. Different samplers (Euler, DPM++, etc.) have different convergence properties.
- The mask compositing step after sampling is critical for inpainting: it ensures that unmasked regions remain exactly as they were in the source, preventing any drift from the denoising process.
Theoretical Basis
The SDEdit process can be formalized as follows. Given the encoded source latent z_0, denoising strength s, and total timesteps T:
Step 1: Determine starting timestep
t_start = schedule(s, T) # maps strength to a timestep index
Step 2: Add noise to source latent
epsilon ~ N(0, I) # generated from seeded RNG
z_t = sqrt(alpha_bar_t) * z_0 + sqrt(1 - alpha_bar_t) * epsilon
where alpha_bar_t is the cumulative noise schedule at t_start
Step 3: Iterative denoising from t_start to 0
for t = t_start, t_start-1, ..., 1, 0:
epsilon_pred = UNet(z_t, t, c_text) # conditional prediction
epsilon_uncond = UNet(z_t, t, c_uncond) # unconditional prediction
epsilon_guided = epsilon_uncond + cfg_scale * (epsilon_pred - epsilon_uncond)
z_{t-1} = sampler_step(z_t, epsilon_guided, t)
Step 4: Mask compositing (for inpainting)
z_final = z_denoised * nmask + z_0 * mask
The mask compositing formula ensures exact preservation of unmasked regions:
samples_final = samples * nmask + init_latent * mask
where:
nmask = 1.0 in regions to regenerate (masked area)
mask = 1.0 in regions to preserve (unmasked area)
nmask + mask = 1.0 everywhere
The noise multiplier provides an additional scaling factor:
if initial_noise_multiplier != 1.0:
noise = noise * initial_noise_multiplier
This allows boosting or reducing the noise amplitude beyond what the denoising strength alone controls, useful for fine-tuning the balance between randomness and source fidelity.
The script hook system allows extensions to modify the blending behavior via MaskBlendArgs, enabling custom compositing strategies such as gradient-aware blending or frequency-domain compositing.