Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Memory Optimization

From Leeroopedia


Template:Principle

Overview

Technique for reducing GPU memory consumption during video generation by offloading model components and optimizing VAE processing.

Description

Video generation with large transformer models requires significant GPU memory. Three complementary strategies are used to reduce peak VRAM consumption:

  1. Sequential CPU offloading -- Moves each model component to GPU only during its forward pass, then moves it back to CPU. This minimizes peak VRAM by ensuring only one component resides on the GPU at any time.
  2. VAE slicing -- Processes video frames in slices rather than all at once. Instead of decoding all frames simultaneously, the VAE processes one frame (or a small batch) at a time.
  3. VAE tiling -- Processes spatial dimensions in tiles rather than the full resolution. Instead of decoding the entire spatial extent at once, the VAE processes overlapping tiles and blends them together.

Together these strategies enable generation on consumer GPUs with 16-24GB VRAM, which would otherwise be insufficient for the large CogVideoX models.

Usage

Use when GPU memory is limited. The strategies can be combined:

Strategy Memory Savings Speed Impact When to Use
enable_sequential_cpu_offload() Highest (lowest VRAM) Slower (CPU-GPU transfers) Consumer GPUs with 16GB VRAM
enable_model_cpu_offload() Moderate Moderate overhead GPUs with 24GB VRAM
vae.enable_slicing() Reduces VAE peak memory Minimal Always for video generation
vae.enable_tiling() Reduces VAE spatial memory Minimal Always for video generation

Recommendation: Always enable VAE slicing and tiling for video generation. Choose between sequential and model CPU offloading based on available VRAM.

Theoretical Basis

Sequential CPU Offloading

Sequential offloading trades compute time (CPU to GPU data transfers) for memory. At any given time, only one model component occupies GPU memory. The peak VRAM usage becomes max(size(component_i)) rather than sum(size(all_components)).

VAE Slicing

VAE slicing reduces peak memory from O(F x C x H x W) to O(C x H x W) per slice, where:

  • F = number of frames
  • C = number of channels
  • H = height
  • W = width

VAE Tiling

VAE tiling reduces spatial memory from O(H x W) to O(tile_h x tile_w) per tile. Overlapping tiles with blending at boundaries prevent visible seams in the output.

Knowledge Sources

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment