Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Zai org CogVideo CPU Offload Strategy

From Leeroopedia



Knowledge Sources
Domains Inference, Memory_Optimization, Video_Generation
Last Updated 2026-02-10 02:00 GMT

Overview

Three-tier CPU offload strategy for CogVideoX inference: sequential offload (minimum memory), model offload (balanced), or no offload with device_map="balanced" for multi-GPU setups.

Description

CogVideoX inference requires careful selection of the CPU offload strategy based on available VRAM and the number of GPUs. The Diffusers pipeline offers three modes with dramatically different memory-speed trade-offs. Additionally, DeepSpeed-based training cannot use Diffusers CPU offload at all, requiring its own memory management. Choosing the wrong strategy can result in OOM errors or unnecessarily slow inference.

Usage

Apply this heuristic when setting up inference pipelines or during validation within training. The choice depends on (1) available VRAM per GPU and (2) number of GPUs.

The Insight (Rule of Thumb)

Single GPU:

  • Low VRAM (< 16GB): Use `pipe.enable_sequential_cpu_offload()` — Moves individual layers to CPU after each use. Slowest but minimum VRAM (~4-6GB).
  • Medium VRAM (16-24GB): Use `pipe.enable_model_cpu_offload()` — Moves entire model components (VAE, text encoder, transformer) between CPU and GPU. Good balance of speed and memory.
  • High VRAM (> 24GB): No offload needed. Fastest inference at 3-4x speedup over sequential offload.

Multi-GPU:

  • Disable ALL CPU offload. Use `device_map="balanced"` in `from_pretrained()` instead to distribute model layers across GPUs.

Training (DeepSpeed):

  • Cannot use `model_cpu_offload` or `sequential_cpu_offload` with DeepSpeed. DeepSpeed manages its own memory partitioning. Move all pipeline components to device manually instead.

Always enable:

  • `pipe.vae.enable_slicing()` — Non-optional even in multi-GPU setups.
  • `pipe.vae.enable_tiling()` — Non-optional even in multi-GPU setups.

Reasoning

From `inference/cli_demo.py:17`:

# You can change `pipe.enable_sequential_cpu_offload()` to
# `pipe.enable_model_cpu_offload()` to speed up inference,
# but this will use more GPU memory

DeepSpeed incompatibility from `finetune/trainer.py:521-524`:

if self.state.using_deepspeed:
    # Can't using model_cpu_offload in deepspeed,
    # so we need to move all components in pipe to device

Multi-GPU guidance from `inference/cli_demo.py:92-93`:

# add device_map="balanced" in the from_pretrained function and
# remove the enable_model_cpu_offload() function to use Multi GPUs.

VAE slicing/tiling mandatory comment from `tools/parallel_inference/parallel_inference_xdit.py:68-70`:

# Always enable tiling and slicing to avoid VAE OOM while batch size > 1
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment