Heuristic:AUTOMATIC1111 Stable diffusion webui VRAM Management Strategies
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Three-tier VRAM optimization strategy (medvram-sdxl, medvram, lowvram) that enables Stable Diffusion generation on GPUs with as little as 4GB VRAM by dynamically swapping model modules between CPU and GPU.
Description
The WebUI implements a progressive VRAM optimization system with three levels of aggressiveness. At the core is a forward-hook-based module swapping mechanism: large model components (text encoder, VAE, UNet) are kept on CPU and moved to GPU only when their `forward()` method is called. This is controlled by `forward_pre_hook` callbacks that automatically manage the transfers. The VAE requires special handling because it uses `encode()` and `decode()` methods directly instead of `forward()`, so these are manually wrapped.
Usage
Use these optimizations when you encounter CUDA out of memory errors or when running on GPUs with limited VRAM (4-8GB). The choice between levels depends on your GPU:
- --medvram-sdxl: Only applies to SDXL models. Best for 8GB GPUs running SDXL.
- --medvram: Keeps the UNet as a single GPU-resident unit but swaps other modules. Good for 6-8GB GPUs.
- --lowvram: Splits even the UNet into individual blocks. Enables 4GB GPU operation at significant speed cost.
The Insight (Rule of Thumb)
- Action: Choose the appropriate `--medvram` or `--lowvram` flag based on available VRAM.
- Value: `--medvram` for 6-8GB cards, `--lowvram` for 4GB cards, `--medvram-sdxl` for 8GB cards with SDXL only.
- Trade-off: `--medvram` adds minor speed overhead; `--lowvram` reduces speed significantly (each UNet block transfers CPU<->GPU individually). `--lowvram` also disables parallel processing of conditional/unconditional batches.
- Key constraint: In lowvram mode, only one module resides on GPU at any time. This eliminates VRAM fragmentation but forces sequential execution.
Reasoning
Stable Diffusion models have three major components: the text encoder (~500MB), the VAE (~300MB), and the UNet (~3.4GB for SD1.5, ~6.5GB for SDXL). Loading all simultaneously requires ~4-7GB+ VRAM. The module-swapping approach exploits the fact that these components are used sequentially during inference: text encoding happens first, then iterative UNet denoising, then VAE decoding. By keeping only the active component in VRAM, total peak usage is reduced to roughly the size of the largest single component.
The lowvram mode goes further by splitting the UNet's input_blocks, middle_block, output_blocks, and time_embed into individual hookable modules, reducing peak VRAM to roughly the size of the largest individual UNet block (~200-400MB).
Code Evidence
Module swapping hook from `modules/lowvram.py:42-58`:
def send_me_to_gpu(module, _):
"""send this module to GPU; send whatever tracked module was previous in GPU to CPU;
we add this as forward_pre_hook to a lot of modules and this way all but one of them will
be in CPU
"""
global module_in_gpu
module = parents.get(module, module)
if module_in_gpu == module:
return
if module_in_gpu is not None:
module_in_gpu.to(cpu)
module.to(devices.device)
module_in_gpu = module
VAE special handling from `modules/lowvram.py:60-74`:
# first_stage_model does not use forward(), it uses encode/decode, so
# register_forward_pre_hook is useless here, and we just replace those methods
first_stage_model = sd_model.first_stage_model
first_stage_model_encode = sd_model.first_stage_model.encode
first_stage_model_decode = sd_model.first_stage_model.decode
def first_stage_model_encode_wrap(x):
send_me_to_gpu(first_stage_model, None)
return first_stage_model_encode(x)
Lowvram UNet block splitting from `modules/lowvram.py:146-161`:
# the third remaining model is still too big for 4 GB, so we also do the same
# for its submodules so that only one of them is in GPU at a time
stored = diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed
diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed = None, None, None, None
sd_model.model.to(devices.device)
diff_model.input_blocks, diff_model.middle_block, diff_model.output_blocks, diff_model.time_embed = stored
# install hooks for bits of third model
diff_model.time_embed.register_forward_pre_hook(send_me_to_gpu)
for block in diff_model.input_blocks:
block.register_forward_pre_hook(send_me_to_gpu)