Heuristic:Zai org CogVideo Memory Optimization Strategies
| Knowledge Sources | |
|---|---|
| Domains | Memory_Optimization, Training, Inference, Video_Generation |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Layered memory optimization strategy combining gradient checkpointing, VAE slicing/tiling, latent caching, and component offloading to fit CogVideoX training and inference within GPU VRAM constraints.
Description
CogVideoX models are extremely memory-intensive due to the combination of 3D video data, large transformer backbones, and VAE encoding/decoding. The codebase implements a multi-layered memory optimization strategy that can reduce VRAM usage by 50-80%. Each optimization layer can be independently enabled, and they compose together for maximum memory savings at the cost of throughput.
Usage
Apply these optimizations when encountering CUDA OOM errors during training or inference, or when working with GPUs that have less VRAM than the model's full-precision requirements. Start with gradient checkpointing + VAE slicing/tiling (defaults), then add latent caching and CPU offload as needed.
The Insight (Rule of Thumb)
Training optimizations (ordered by impact):
- Gradient checkpointing: `gradient_checkpointing=True` (default). Trades ~20% compute for ~50% VRAM reduction. Essential for all video model training.
- VAE slicing: `enable_slicing=True` (default). Processes video frames one-at-a-time through the VAE instead of batched.
- VAE tiling: `enable_tiling=True` (default). Processes spatial patches instead of full frames through the VAE.
- Latent precomputation: Encodes all videos to latent space before training, then unloads the VAE and text encoder from GPU. Saves 5-15GB VRAM.
- Component unloading: Each trainer specifies an `UNLOAD_LIST` of non-training components (e.g., `text_encoder`) to move off GPU.
- Float32 for trainable params: LoRA parameters must stay in float32 for gradient stability, even under mixed precision.
- DeepSpeed ZeRO: Stage 2 shards optimizer state; Stage 3 shards parameters across GPUs. Stage 3 with CPU offloading enables training on consumer GPUs.
Inference optimizations (ordered by memory savings):
- Sequential CPU offload: `pipe.enable_sequential_cpu_offload()` — Maximum memory savings, slowest.
- Model CPU offload: `pipe.enable_model_cpu_offload()` — Moderate savings, faster than sequential.
- No offload: Fastest, requires 3x more VRAM. Use `device_map="balanced"` for multi-GPU.
Memory cleanup pattern:
- Always call `gc.collect()` + `torch.cuda.empty_cache()` + `torch.cuda.ipc_collect()` together for complete cleanup.
Reasoning
Video training operates on 5D tensors (batch x frames x channels x height x width), which are orders of magnitude larger than image training tensors. A single CogVideoX-5B training batch at 49 frames x 480x720 resolution requires ~24GB of activation memory without checkpointing. The layered approach allows users to trade compute time for memory at each level.
The triple-call memory cleanup in `finetune/utils/memory_utils.py:45-51`:
def free_memory() -> None:
if torch.cuda.is_available():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
Latent precomputation and model unloading from `finetune/trainer.py:196-212`:
# Precompute latent for video and prompt embedding
logger.info("Precomputing latent for video and prompt embedding ...")
tmp_data_loader = torch.utils.data.DataLoader(self.dataset, ..., batch_size=1, num_workers=0)
# ... encode all data ...
unload_model(self.components.vae)
unload_model(self.components.text_encoder)
free_memory()
Speed vs memory trade-off from `README.md:279-286`:
Disabling all three optimizations (CPU offload, slicing, tiling) gives 3-4x speedup at 3x memory cost.