Principle:Huggingface Diffusers Video Memory Management
| Property | Value |
|---|---|
| Principle Name | Video Memory Management |
| Overview | Memory optimization techniques specific to video generation including VAE tiling, CPU offloading, and sliced decoding |
| Domains | Video Generation, GPU Memory Optimization |
| Related Implementation | Huggingface_Diffusers_Video_Memory_Setup |
| Knowledge Sources | Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/pipelines/pipeline_utils.py:L1174-L1270, src/diffusers/models/autoencoders/autoencoder_kl_wan.py:L1086-L1114)
|
| Last Updated | 2026-02-13 00:00 GMT |
Description
Video generation models require substantially more memory than image generation because they operate on 5D tensors (B, C, F, H, W) with dozens to hundreds of frames. A single 720p 81-frame video's latent representation at 16 channels is approximately 1 x 16 x 21 x 60 x 104 floats. The three primary optimization strategies are:
- Model CPU Offloading (
enable_model_cpu_offload) - Moves entire model components to GPU only when needed, following a sequential offload chain defined bymodel_cpu_offload_seq. - VAE Tiling (
enable_tiling) - Splits the spatial dimensions of VAE inputs into overlapping tiles, processes them independently, and blends the results. - VAE Slicing (
enable_slicing) - Processes batch dimension one sample at a time during encoding and decoding.
Theoretical Basis
Model CPU Offloading
The offload chain defines the order in which components move to GPU. For video pipelines:
- Wan:
text_encoder -> transformer -> transformer_2 -> vae - HunyuanVideo:
text_encoder -> text_encoder_2 -> transformer -> vae - CogVideoX:
text_encoder -> transformer -> vae
Each component is wrapped with an accelerate hook (cpu_offload_with_hook). When a component's forward() is called, the hook moves it to GPU. When the next component in the chain runs, the previous one is moved back to CPU. This means only one large model is on GPU at a time.
VAE Tiling
The VAE is the primary memory bottleneck during decoding because it operates at full pixel resolution. Tiling works by:
- Splitting the latent tensor into overlapping spatial tiles (controlled by
tile_sample_min_height,tile_sample_min_width) - Decoding each tile independently through the full decoder
- Blending overlapping regions using linear interpolation (
blend_vandblend_hmethods) to avoid seam artifacts - The stride parameters (
tile_sample_stride_height,tile_sample_stride_width) control the overlap between adjacent tiles
For temporal processing, the Wan VAE decodes one latent frame at a time using a feature caching system (feat_cache) that maintains causal convolution state across frames. This means temporal memory usage is constant regardless of video length.
VAE Slicing
When the batch dimension is greater than 1, slicing processes each batch element independently through the encoder/decoder rather than all at once, trading throughput for peak memory.
Usage
Apply these optimizations after pipeline instantiation but before calling the pipeline:
- Always enable
model_cpu_offloadwhen GPU memory is limited (< 24GB for 14B models) - Always enable VAE tiling for videos above 480p resolution
- Enable VAE slicing only for batch sizes > 1
- These techniques can be combined: offloading + tiling together provides maximum memory savings
Related Pages
- Huggingface_Diffusers_Video_Memory_Setup (implements this principle) - Concrete API calls for enabling memory optimizations
- Huggingface_Diffusers_Video_Pipeline_Selection (prerequisite) - Pipeline must be instantiated before configuring memory
- Huggingface_Diffusers_Video_Decoding_Export (benefits from this) - Decoding is the primary consumer of the tiling optimization
Implementation:Huggingface_Diffusers_Video_Memory_Setup