Implementation:Huggingface Diffusers Video Memory Setup

Field	Value
Type	API Doc
Overview	Concrete API calls for enabling model CPU offloading, VAE tiling, and VAE slicing on video generation pipelines
Domains	Video Generation, GPU Memory Optimization
Workflow	Video_Generation
Related Principle	Huggingface_Diffusers_Video_Memory_Management
Source	`src/diffusers/pipelines/pipeline_utils.py:L1174-L1270`, `src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1086-L1114`
Last Updated	2026-02-13 00:00 GMT

Code Reference

enable_model_cpu_offload

Source: src/diffusers/pipelines/pipeline_utils.py:L1174-L1268

def enable_model_cpu_offload(self, gpu_id: int | None = None, device: torch.device | str = None):
    """
    Offloads all models to CPU using accelerate, reducing memory usage with a low impact on
    performance. Compared to enable_sequential_cpu_offload, this method moves one whole model
    at a time to the accelerator when its forward method is called.
    """
    # ...
    self.to("cpu", silence_dtype_warnings=True)
    empty_device_cache(device.type)

    all_model_components = {k: v for k, v in self.components.items() if isinstance(v, torch.nn.Module)}
    self._all_hooks = []
    hook = None
    for model_str in self.model_cpu_offload_seq.split("->"):
        model = all_model_components.pop(model_str, None)
        if not isinstance(model, torch.nn.Module):
            continue
        _, hook = cpu_offload_with_hook(model, device, prev_module_hook=hook)
        self._all_hooks.append(hook)

enable_tiling (AutoencoderKLWan)

Source: src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1086-L1114

def enable_tiling(
    self,
    tile_sample_min_height: int | None = None,
    tile_sample_min_width: int | None = None,
    tile_sample_stride_height: float | None = None,
    tile_sample_stride_width: float | None = None,
) -> None:
    """
    Enable tiled VAE decoding. When this option is enabled, the VAE will split the input
    tensor into tiles to compute decoding and encoding in several steps. This is useful for
    saving a large amount of memory and to allow processing larger images.
    """
    self.use_tiling = True
    self.tile_sample_min_height = tile_sample_min_height or self.tile_sample_min_height
    self.tile_sample_min_width = tile_sample_min_width or self.tile_sample_min_width
    self.tile_sample_stride_height = tile_sample_stride_height or self.tile_sample_stride_height
    self.tile_sample_stride_width = tile_sample_stride_width or self.tile_sample_stride_width

Key Parameters

Method	Parameter	Description	Default
`enable_model_cpu_offload`	`gpu_id`	GPU device ID to use	`0`
`enable_model_cpu_offload`	`device`	PyTorch device type string	Auto-detected
`enable_tiling`	`tile_sample_min_height`	Minimum tile height in pixels	`256`
`enable_tiling`	`tile_sample_min_width`	Minimum tile width in pixels	`256`
`enable_tiling`	`tile_sample_stride_height`	Stride between vertical tiles	`192`
`enable_tiling`	`tile_sample_stride_width`	Stride between horizontal tiles	`192`

I/O Contract

enable_model_cpu_offload

Inputs:

Pipeline instance with all model components loaded

Outputs:

Modified pipeline where all model modules have been moved to CPU and wrapped with accelerate hooks for automatic GPU migration

Side Effects:

Clears GPU memory cache
Sets self._all_hooks with the offload hook chain
Sets self._offload_device and self._offload_gpu_id

enable_tiling

Inputs:

VAE instance (AutoencoderKLWan, AutoencoderKLHunyuanVideo, etc.)

Outputs:

Modified VAE with use_tiling = True and configured tile dimensions

Side Effects:

_decode and _encode methods will route to tiled_decode/tiled_encode when spatial dimensions exceed tile minimums

External Dependencies

accelerate >= 0.17.0 (for cpu_offload_with_hook)
CUDA-compatible GPU (CPU offloading requires a CUDA/MPS/XPU device)

Usage Examples

Minimal Memory Configuration for Wan 14B

import torch
from diffusers import AutoencoderKLWan, WanPipeline

model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

# Generate - only one component on GPU at a time
output = pipe(prompt="A sunset over mountains", num_frames=81, height=720, width=1280)

Custom Tile Sizes for HunyuanVideo

import torch
from diffusers import HunyuanVideoPipeline

pipe = HunyuanVideoPipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

# Larger tiles = fewer seams but more memory per tile
pipe.vae.enable_tiling(
    tile_sample_min_height=512,
    tile_sample_min_width=512,
    tile_sample_stride_height=384,
    tile_sample_stride_width=384,
)

Combining Tiling and Slicing for Batch Processing

pipe.vae.enable_tiling()
pipe.vae.enable_slicing()  # Process batch elements one at a time

Related Pages

Huggingface_Diffusers_Video_Memory_Management (principle for this implementation) - Theory behind memory optimization strategies
Huggingface_Diffusers_Video_Pipeline_From_Pretrained (prerequisite) - Pipeline must be loaded before enabling optimizations
Huggingface_Diffusers_Export_To_Video (benefits from tiling) - Decoding step uses the tiling configuration

Principle:Huggingface_Diffusers_Video_Memory_Management

Requires Environment

Environment:Huggingface_Diffusers_PyTorch_CUDA_Runtime

Uses Heuristic

Heuristic:Huggingface_Diffusers_Memory_Offloading_Strategy

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment