Heuristic:Huggingface Diffusers Memory Offloading Strategy

Knowledge Sources	Huggingface Diffusers Diffusers Memory Guide
Domains	Optimization, Memory_Management
Last Updated	2026-02-13 21:00 GMT

Overview

Memory optimization strategy using CPU offloading and VAE tiling/slicing to fit large diffusion models on limited GPU VRAM.

Description

Diffusers provides three progressive memory optimization levels: (1) Model CPU Offload moves entire model components between CPU and GPU based on the pipeline execution sequence, offering the best performance-memory trade-off; (2) Sequential CPU Offload moves individual layers to GPU during forward pass, achieving maximum memory savings at the cost of slower execution; (3) VAE Tiling/Slicing processes the VAE encode/decode in smaller patches to avoid OOM on the VAE component specifically. These strategies can be combined and are controlled via simple API calls on any pipeline.

Usage

Use when encountering CUDA out of memory errors during inference. Start with `enable_model_cpu_offload()` as it has the least performance impact. If still OOM, add VAE tiling. As a last resort, use `enable_sequential_cpu_offload()`.

The Insight (Rule of Thumb)

Action 1: Call `pipe.enable_model_cpu_offload()` — whole-model offloading with minimal speed impact.
Action 2: Call `pipe.enable_vae_tiling()` — process VAE in spatial patches if VAE decode causes OOM.
Action 3: Call `pipe.enable_sequential_cpu_offload()` — layer-by-layer offloading for maximum memory savings.
Ordering: The pipeline's `model_cpu_offload_seq` attribute defines component execution order (e.g., `"text_encoder->transformer->vae"`).
Trade-off: Model CPU offload adds ~10% latency; sequential offload can add ~50%+ latency.
Restriction: Cannot combine CPU offload with device mapping (`hf_device_map`). Call `reset_device_map()` first.
GPU Required: `enable_model_cpu_offload()` raises RuntimeError if no accelerator found.

Reasoning

Model CPU offload keeps only one model component in GPU memory at a time, moving others to CPU RAM. The key insight from `pipeline_utils.py` is that diffusion pipelines execute components sequentially (text encoder, then UNet/transformer, then VAE), so only one component needs GPU memory at any moment. The `model_cpu_offload_seq` class attribute explicitly defines this execution order for each pipeline class. Sequential offload goes further by offloading individual linear/conv layers, but the constant CPU-GPU transfers create significant overhead. VAE tiling is orthogonal — it reduces peak memory within a single component by processing spatial patches instead of the full image.

For video generation, additional memory strategies include VAE temporal tiling (processing frames in chunks of `decode_chunk_size`, typically 16 frames) and enable_tiling() which tiles both spatially and temporally.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment