Principle:Zai org CogVideo Memory Optimization

Overview

Technique for reducing GPU memory consumption during video generation by offloading model components and optimizing VAE processing.

Description

Video generation with large transformer models requires significant GPU memory. Three complementary strategies are used to reduce peak VRAM consumption:

Sequential CPU offloading -- Moves each model component to GPU only during its forward pass, then moves it back to CPU. This minimizes peak VRAM by ensuring only one component resides on the GPU at any time.
VAE slicing -- Processes video frames in slices rather than all at once. Instead of decoding all frames simultaneously, the VAE processes one frame (or a small batch) at a time.
VAE tiling -- Processes spatial dimensions in tiles rather than the full resolution. Instead of decoding the entire spatial extent at once, the VAE processes overlapping tiles and blends them together.

Together these strategies enable generation on consumer GPUs with 16-24GB VRAM, which would otherwise be insufficient for the large CogVideoX models.

Usage

Use when GPU memory is limited. The strategies can be combined:

Strategy	Memory Savings	Speed Impact	When to Use
enable_sequential_cpu_offload()	Highest (lowest VRAM)	Slower (CPU-GPU transfers)	Consumer GPUs with 16GB VRAM
enable_model_cpu_offload()	Moderate	Moderate overhead	GPUs with 24GB VRAM
vae.enable_slicing()	Reduces VAE peak memory	Minimal	Always for video generation
vae.enable_tiling()	Reduces VAE spatial memory	Minimal	Always for video generation

Recommendation: Always enable VAE slicing and tiling for video generation. Choose between sequential and model CPU offloading based on available VRAM.

Theoretical Basis

Sequential CPU Offloading

Sequential offloading trades compute time (CPU to GPU data transfers) for memory. At any given time, only one model component occupies GPU memory. The peak VRAM usage becomes max(size(component_i)) rather than sum(size(all_components)).

VAE Slicing

VAE slicing reduces peak memory from O(F x C x H x W) to O(C x H x W) per slice, where:

F = number of frames
C = number of channels
H = height
W = width

VAE Tiling

VAE tiling reduces spatial memory from O(H x W) to O(tile_h x tile_w) per tile. Overlapping tiles with blending at boundaries prevent visible seams in the output.

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment