Principle:AUTOMATIC1111 Stable diffusion webui Image upscaling

Knowledge Sources	ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data SwinIR: Image Restoration Using Swin Transformer DAT: Dual Aggregation Transformer for Image Super-Resolution HAT: Activating More Pixels in Image Super-Resolution Transformer
Domains	Super-Resolution, Deep Learning, Computer Vision, Image Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Image upscaling (super-resolution) is the process of increasing the spatial resolution of an image beyond its original pixel dimensions while reconstructing plausible high-frequency detail that was not present in the low-resolution input.

Description

Classical upscaling methods such as bilinear, bicubic, and Lanczos interpolation produce smooth results but cannot recover fine details lost at lower resolutions. Deep learning-based super-resolution models learn to hallucinate realistic high-frequency content (textures, edges, fine structures) from training on paired low-resolution and high-resolution image datasets.

The major families of deep learning upscalers used in Stable Diffusion workflows include:

ESRGAN (Enhanced Super-Resolution GAN): A generative adversarial network that uses a Residual-in-Residual Dense Block (RRDB) architecture without batch normalization. It optimizes a combination of pixel loss, perceptual loss (VGG feature matching), and adversarial loss to produce photo-realistic textures.
Real-ESRGAN: An extension of ESRGAN trained with a high-order degradation pipeline that synthesizes realistic compression artifacts, blur, noise, and resize chains. This makes it effective on "real-world" images rather than just synthetically downscaled ones.
SwinIR: A Transformer-based architecture that uses Swin Transformer blocks with shifted window self-attention. It captures long-range dependencies more effectively than convolutional approaches while maintaining computational efficiency through local windowed attention.
DAT (Dual Aggregation Transformer): Extends the Transformer approach with dual aggregation of spatial and channel attention, enabling more effective feature extraction for super-resolution tasks.
HAT (Hybrid Attention Transformer): Combines channel attention with window-based self-attention and introduces a cross-window information interaction mechanism to activate more pixels during reconstruction.

Usage

Use image upscaling when:

Generated images need to be output at higher resolutions than the model's native generation size (e.g., upscaling a 512x512 generation to 2048x2048)
Enhancing detail in images that were generated or captured at insufficient resolution
Preparing images for print or high-DPI display applications
As a postprocessing step after generation but before face restoration or other detail-sensitive operations

Theoretical Basis

Super-Resolution Problem Formulation

Given a low-resolution image I_LR of size H x W, the goal is to produce a high-resolution image I_HR of size sH x sW where s is the scale factor:

I_HR = F(I_LR; theta)

where F is the learned upscaling model with parameters theta.

GAN-Based Training Objective (ESRGAN)

The ESRGAN training loss combines three components:

L_total = L_pixel + lambda_perceptual * L_perceptual + lambda_adversarial * L_adversarial

L_pixel     = ||F(I_LR) - I_HR||_1           (L1 pixel loss)
L_perceptual = ||phi(F(I_LR)) - phi(I_HR)||_2  (VGG feature loss)
L_adversarial = -log(D(F(I_LR)))                (GAN loss)

where phi is a pre-trained VGG feature extractor and D is the discriminator network.

Tiled Upscaling for VRAM Management

Large images cannot fit into GPU VRAM in a single forward pass. Tiled upscaling addresses this by splitting the image into overlapping tiles, upscaling each tile independently, and blending them back together:

def tiled_upscale(image, model, tile_size, overlap):
    grid = split_into_tiles(image, tile_size, overlap)
    upscaled_tiles = []
    for tile in grid:
        upscaled_tile = model(tile)
        upscaled_tiles.append(upscaled_tile)
    return blend_tiles(upscaled_tiles, scale_factor, overlap)

Two blending strategies are common:

Pillow-space blending: Tiles are converted back to PIL Images and combined using alpha masks at overlap regions. This is simpler but does not apply weighting across overlaps.
Tensor-space blending with weight accumulation: Tile outputs are accumulated in a result tensor with an associated weight tensor. Overlapping regions receive contributions from multiple tiles. The final image is produced by dividing the accumulated values by the accumulated weights, producing smooth transitions.

The tile size is typically configured between 128 and 512 pixels, with overlap of 8 to 64 pixels. Smaller tiles reduce peak VRAM usage but increase processing time due to overlap redundancy.

Iterative Upscaling

When the desired scale factor exceeds the model's native scale (typically 4x), iterative application is used. The image is upscaled multiple times until it reaches or exceeds the target dimensions, then resized down to the exact target using Lanczos interpolation. A maximum of 3 iterations prevents infinite loops when the model fails to increase dimensions.

Related Pages

Implemented By

Implementation:AUTOMATIC1111_Stable_diffusion_webui_Upscaler_upscale

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment