Heuristic:Zai org CogVideo CPU Offload Strategy
| Knowledge Sources | |
|---|---|
| Domains | Inference, Memory_Optimization, Video_Generation |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Three-tier CPU offload strategy for CogVideoX inference: sequential offload (minimum memory), model offload (balanced), or no offload with device_map="balanced" for multi-GPU setups.
Description
CogVideoX inference requires careful selection of the CPU offload strategy based on available VRAM and the number of GPUs. The Diffusers pipeline offers three modes with dramatically different memory-speed trade-offs. Additionally, DeepSpeed-based training cannot use Diffusers CPU offload at all, requiring its own memory management. Choosing the wrong strategy can result in OOM errors or unnecessarily slow inference.
Usage
Apply this heuristic when setting up inference pipelines or during validation within training. The choice depends on (1) available VRAM per GPU and (2) number of GPUs.
The Insight (Rule of Thumb)
Single GPU:
- Low VRAM (< 16GB): Use `pipe.enable_sequential_cpu_offload()` — Moves individual layers to CPU after each use. Slowest but minimum VRAM (~4-6GB).
- Medium VRAM (16-24GB): Use `pipe.enable_model_cpu_offload()` — Moves entire model components (VAE, text encoder, transformer) between CPU and GPU. Good balance of speed and memory.
- High VRAM (> 24GB): No offload needed. Fastest inference at 3-4x speedup over sequential offload.
Multi-GPU:
- Disable ALL CPU offload. Use `device_map="balanced"` in `from_pretrained()` instead to distribute model layers across GPUs.
Training (DeepSpeed):
- Cannot use `model_cpu_offload` or `sequential_cpu_offload` with DeepSpeed. DeepSpeed manages its own memory partitioning. Move all pipeline components to device manually instead.
Always enable:
- `pipe.vae.enable_slicing()` — Non-optional even in multi-GPU setups.
- `pipe.vae.enable_tiling()` — Non-optional even in multi-GPU setups.
Reasoning
From `inference/cli_demo.py:17`:
# You can change `pipe.enable_sequential_cpu_offload()` to
# `pipe.enable_model_cpu_offload()` to speed up inference,
# but this will use more GPU memory
DeepSpeed incompatibility from `finetune/trainer.py:521-524`:
if self.state.using_deepspeed:
# Can't using model_cpu_offload in deepspeed,
# so we need to move all components in pipe to device
Multi-GPU guidance from `inference/cli_demo.py:92-93`:
# add device_map="balanced" in the from_pretrained function and
# remove the enable_model_cpu_offload() function to use Multi GPUs.
VAE slicing/tiling mandatory comment from `tools/parallel_inference/parallel_inference_xdit.py:68-70`:
# Always enable tiling and slicing to avoid VAE OOM while batch size > 1
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()