Principle:AUTOMATIC1111 Stable diffusion webui High resolution fix
| Knowledge Sources | |
|---|---|
| Domains | Diffusion Models, Image Upscaling, Multi-Pass Generation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
High-resolution fix is a two-pass generation technique that produces high-resolution images by first generating at native model resolution, then upscaling and denoising at the target resolution to avoid the compositional artifacts caused by direct high-resolution generation.
Description
Stable Diffusion models (particularly SD1.x) are trained at a specific resolution, typically 512x512 pixels. When generating directly at significantly higher resolutions (e.g., 1024x1024 or 1024x1536), the model tends to produce characteristic artifacts:
- Duplicate subjects -- The model may generate two or more copies of the main subject, as if tiling
- Anatomical distortions -- Limbs, faces, and body proportions become severely distorted
- Compositional incoherence -- The overall scene layout breaks down
These artifacts occur because the model's UNet was trained with a fixed receptive field relative to the training resolution. At higher resolutions, the same receptive field covers a proportionally smaller area of the image, causing the model to treat different regions as separate compositions.
The high-resolution fix (hires fix) solves this by splitting generation into two passes:
- First pass (composition) -- Generate at or near the model's native resolution (e.g., 512x512) to establish a coherent composition
- Upscale -- Scale the first-pass result to the target high resolution using either latent-space interpolation or a pixel-space upscaler
- Second pass (refinement) -- Denoise the upscaled result at a reduced denoising strength (typically 0.4-0.7) to add fine detail while preserving the established composition
Usage
Hires fix is used whenever the desired output resolution significantly exceeds the model's native training resolution. Common scenarios include:
- Generating wallpaper-sized images (1920x1080 or higher) from SD1.x models
- Creating detailed portraits at resolutions suitable for printing
- Producing images with fine detail that would be lost at 512x512
The technique is a trade-off: it roughly doubles the generation time but dramatically improves quality at high resolutions.
Theoretical Basis
Why Direct High-Resolution Fails
The UNet's convolutional layers and attention mechanisms have effective receptive fields calibrated to the training resolution. At resolution R_train, the deepest layers of the UNet can "see" the entire image. At resolution R > R_train, the same layers only see a portion:
Effective coverage = R_train / R
At 2x resolution: each UNet pass "sees" approximately 1/4 of the image area
At 3x resolution: each UNet pass "sees" approximately 1/9 of the image area
This causes the model to generate independent compositions in different spatial regions.
Two-Pass Denoising
The hires fix leverages the img2img principle (SDEdit): given an upscaled image with correct global composition but lacking fine detail, partial denoising can add detail while preserving structure.
The denoising strength parameter controls this trade-off:
denoising_strength = 0.0 -> No change (output = upscaled input)
denoising_strength = 0.5 -> Add significant detail, mostly preserve composition
denoising_strength = 1.0 -> Full re-generation (composition may change)
The second pass starts from a noise level corresponding to denoising_strength on the noise schedule, skipping the early high-noise steps.
Upscaling Methods
The intermediate upscaling can use:
- Latent space upscalers -- Nearest-neighbor, bilinear, or bicubic interpolation directly in latent space (fast, no decoding/re-encoding needed)
- Pixel space upscalers -- Decode to pixels, upscale with a neural upscaler (ESRGAN, SwinIR, etc.) or traditional algorithm (Lanczos), then re-encode to latent space (higher quality, more compute)