Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:AUTOMATIC1111 Stable diffusion webui VRAM Management Strategies

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management
Last Updated 2026-02-08 08:00 GMT

Overview

Three-tier VRAM optimization strategy (medvram-sdxl, medvram, lowvram) that enables Stable Diffusion generation on GPUs with as little as 4GB VRAM by dynamically swapping model modules between CPU and GPU.

Description

The WebUI implements a progressive VRAM optimization system with three levels of aggressiveness. At the core is a forward-hook-based module swapping mechanism: large model components (text encoder, VAE, UNet) are kept on CPU and moved to GPU only when their `forward()` method is called. This is controlled by `forward_pre_hook` callbacks that automatically manage the transfers. The VAE requires special handling because it uses `encode()` and `decode()` methods directly instead of `forward()`, so these are manually wrapped.

Usage

Use these optimizations when you encounter CUDA out of memory errors or when running on GPUs with limited VRAM (4-8GB). The choice between levels depends on your GPU:

  • --medvram-sdxl: Only applies to SDXL models. Best for 8GB GPUs running SDXL.
  • --medvram: Keeps the UNet as a single GPU-resident unit but swaps other modules. Good for 6-8GB GPUs.
  • --lowvram: Splits even the UNet into individual blocks. Enables 4GB GPU operation at significant speed cost.

The Insight (Rule of Thumb)

  • Action: Choose the appropriate `--medvram` or `--lowvram` flag based on available VRAM.
  • Value: `--medvram` for 6-8GB cards, `--lowvram` for 4GB cards, `--medvram-sdxl` for 8GB cards with SDXL only.
  • Trade-off: `--medvram` adds minor speed overhead; `--lowvram` reduces speed significantly (each UNet block transfers CPU<->GPU individually). `--lowvram` also disables parallel processing of conditional/unconditional batches.
  • Key constraint: In lowvram mode, only one module resides on GPU at any time. This eliminates VRAM fragmentation but forces sequential execution.

Reasoning

Stable Diffusion models have three major components: the text encoder (~500MB), the VAE (~300MB), and the UNet (~3.4GB for SD1.5, ~6.5GB for SDXL). Loading all simultaneously requires ~4-7GB+ VRAM. The module-swapping approach exploits the fact that these components are used sequentially during inference: text encoding happens first, then iterative UNet denoising, then VAE decoding. By keeping only the active component in VRAM, total peak usage is reduced to roughly the size of the largest single component.

The lowvram mode goes further by splitting the UNet's input_blocks, middle_block, output_blocks, and time_embed into individual hookable modules, reducing peak VRAM to roughly the size of the largest individual UNet block (~200-400MB).

Code Evidence

Module swapping hook from `modules/lowvram.py:42-58`:

def send_me_to_gpu(module, _):
    """send this module to GPU; send whatever tracked module was previous in GPU to CPU;
    we add this as forward_pre_hook to a lot of modules and this way all but one of them will
    be in CPU
    """
    global module_in_gpu

    module = parents.get(module, module)

    if module_in_gpu == module:
        return

    if module_in_gpu is not None:
        module_in_gpu.to(cpu)

    module.to(devices.device)
    module_in_gpu = module

VAE special handling from `modules/lowvram.py:60-74`:

# first_stage_model does not use forward(), it uses encode/decode, so
# register_forward_pre_hook is useless here, and we just replace those methods

first_stage_model = sd_model.first_stage_model
first_stage_model_encode = sd_model.first_stage_model.encode
first_stage_model_decode = sd_model.first_stage_model.decode

def first_stage_model_encode_wrap(x):
    send_me_to_gpu(first_stage_model, None)
    return first_stage_model_encode(x)

Lowvram UNet block splitting from `modules/lowvram.py:146-161`:

# the third remaining model is still too big for 4 GB, so we also do the same
# for its submodules so that only one of them is in GPU at a time
stored = diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed
diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed = None, None, None, None
sd_model.model.to(devices.device)
diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed = stored

# install hooks for bits of third model
diff_model.time_embed.register_forward_pre_hook(send_me_to_gpu)
for block in diff_model.input_blocks:
    block.register_forward_pre_hook(send_me_to_gpu)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment