Principle:AUTOMATIC1111 Stable diffusion webui Image preprocessing and latent encoding
| Knowledge Sources | |
|---|---|
| Domains | Image Generation, Variational Autoencoders, Latent Space, Image Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Image preprocessing and latent encoding is the process of transforming a pixel-space source image into a latent-space tensor suitable for diffusion-based denoising, including mask processing, crop region computation, and content-aware fill of masked regions.
Description
Before the diffusion model can operate on a source image in the img2img pipeline, the image must be transformed from pixel space (H x W x 3 RGB values in [0, 255]) into latent space (C x H/f x W/f floating-point features, where f=8 is the downsampling factor and C=4 is the latent channel count). This transformation is performed by the Variational Autoencoder (VAE) encoder.
The preprocessing pipeline involves several stages:
1. Mask Processing: If an inpainting mask is provided, it is first converted to a binary mask (thresholded at 128), optionally inverted, and then blurred with separate horizontal and vertical Gaussian kernels to create soft edges. For "inpaint only masked" mode, a crop region is computed around the masked area to enable full-resolution inpainting of just the relevant region.
2. Image Resizing and Cropping: The source image is resized according to the resize mode. For inpaint-full-res mode, the image is first cropped to the mask bounding box (with padding) and then resized to the target dimensions, allowing the model to operate at full resolution on the masked region rather than the entire image.
3. Overlay Image Construction: For inpainting, an overlay image is constructed from the original unmasked regions. This overlay is stored as an RGBA image that will be composited over the generated output during post-processing, ensuring that unmasked regions remain pixel-perfect.
4. Content-Aware Fill: Depending on the inpainting fill mode, the masked region of the image may be filled with content-aware interpolation from surrounding pixels before encoding. This provides the VAE encoder with plausible pixel values rather than arbitrary content in the masked area.
5. VAE Encoding: The preprocessed image tensor (normalized to [0, 1], then rearranged to BCHW format) is passed through the VAE encoder to produce the initial latent tensor. This uses images_tensor_to_samples() which supports different encoding approximation methods.
6. Latent Mask Construction: The pixel-space mask is downsampled to latent dimensions and converted to a latent-space tensor pair: mask (1.0 where the image should be preserved) and nmask (1.0 where the image should be regenerated). For fill modes 2 and 3, the latent is further modified by replacing masked regions with random noise or zeros respectively.
Usage
This preprocessing stage is automatically invoked as the first step of the img2img generation pipeline. Understanding it is important for:
- Diagnosing artifacts at inpainting boundaries (adjust mask_blur parameters)
- Understanding resolution effects in inpaint-full-res mode
- Choosing the correct inpainting fill mode for the desired result
- Understanding VAE encoding precision issues (NaN detection, dtype auto-correction)
Theoretical Basis
The VAE encoder maps from pixel space to latent space according to:
z = E(x) where E is the encoder of the VAE
x in R^{B x 3 x H x W} (pixel space, normalized to [-1, 1])
z in R^{B x C x H/f x W/f} (latent space, f=8, C=4)
The encoding is not deterministic in a standard VAE (it samples from a Gaussian posterior), but in practice the Stable Diffusion VAE uses the mode of the posterior (mean) for encoding, making it effectively deterministic.
For the mask transformation from pixel to latent space:
Given pixel mask M of size H x W with values in [0, 1]:
M_latent = resize(M, H/f, W/f) # bilinear downsampling
M_latent = round(M_latent) # if mask_round is True
mask = 1 - M_latent # regions to preserve
nmask = M_latent # regions to regenerate
The crop region computation for inpaint-full-res:
crop_region = get_crop_region(mask, padding)
crop_region = expand_crop_region(crop_region, target_w, target_h, img_w, img_h)
cropped_image = image[y1:y2, x1:x2]
cropped_image = resize(cropped_image, target_w, target_h)
This ensures the diffusion model processes the masked region at the full target resolution rather than at a fraction of it, which significantly improves inpainting quality for small masked areas.
For color correction calibration, the LAB color space histogram of the source image is captured before encoding and stored for later matching during post-processing:
correction_target = RGB_to_LAB(source_image)