Principle:Zai org CogVideo Memory Optimization
Overview
Technique for reducing GPU memory consumption during video generation by offloading model components and optimizing VAE processing.
Description
Video generation with large transformer models requires significant GPU memory. Three complementary strategies are used to reduce peak VRAM consumption:
- Sequential CPU offloading -- Moves each model component to GPU only during its forward pass, then moves it back to CPU. This minimizes peak VRAM by ensuring only one component resides on the GPU at any time.
- VAE slicing -- Processes video frames in slices rather than all at once. Instead of decoding all frames simultaneously, the VAE processes one frame (or a small batch) at a time.
- VAE tiling -- Processes spatial dimensions in tiles rather than the full resolution. Instead of decoding the entire spatial extent at once, the VAE processes overlapping tiles and blends them together.
Together these strategies enable generation on consumer GPUs with 16-24GB VRAM, which would otherwise be insufficient for the large CogVideoX models.
Usage
Use when GPU memory is limited. The strategies can be combined:
| Strategy | Memory Savings | Speed Impact | When to Use |
|---|---|---|---|
| enable_sequential_cpu_offload() | Highest (lowest VRAM) | Slower (CPU-GPU transfers) | Consumer GPUs with 16GB VRAM |
| enable_model_cpu_offload() | Moderate | Moderate overhead | GPUs with 24GB VRAM |
| vae.enable_slicing() | Reduces VAE peak memory | Minimal | Always for video generation |
| vae.enable_tiling() | Reduces VAE spatial memory | Minimal | Always for video generation |
Recommendation: Always enable VAE slicing and tiling for video generation. Choose between sequential and model CPU offloading based on available VRAM.
Theoretical Basis
Sequential CPU Offloading
Sequential offloading trades compute time (CPU to GPU data transfers) for memory. At any given time, only one model component occupies GPU memory. The peak VRAM usage becomes max(size(component_i)) rather than sum(size(all_components)).
VAE Slicing
VAE slicing reduces peak memory from O(F x C x H x W) to O(C x H x W) per slice, where:
- F = number of frames
- C = number of channels
- H = height
- W = width
VAE Tiling
VAE tiling reduces spatial memory from O(H x W) to O(tile_h x tile_w) per tile. Overlapping tiles with blending at boundaries prevent visible seams in the output.