Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Zai org CogVideo Diffusers Inference Environment

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Inference, Deep_Learning
Last Updated 2026-02-10 02:00 GMT

Overview

Linux GPU environment with Python 3.10+, PyTorch >= 2.5.0, CUDA, and HuggingFace Diffusers >= 0.31.0 for text-to-video, image-to-video, and video editing inference with CogVideoX models.

Description

This environment provides the runtime stack for generating videos using pre-trained CogVideoX models via the HuggingFace Diffusers pipeline. It supports text-to-video (T2V), image-to-video (I2V), and DDIM-based video editing inference. The pipeline supports sequential CPU offload and model-level CPU offload for memory-constrained GPUs, VAE slicing/tiling for large videos, and optional INT8/FP8 quantization for reduced memory usage.

Usage

Use this environment for all inference workflows: text-to-video generation, image-to-video generation, DDIM inversion-based video editing, and LoRA weight loading for inference. This is the prerequisite for running `inference/cli_demo.py`, `inference/cli_demo_quantization.py`, `inference/ddim_inversion.py`, and the Gradio demo applications.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) CUDA-compatible OS required
Hardware (standard) NVIDIA GPU >= 12GB VRAM With CPU offload enabled
Hardware (no offload) NVIDIA GPU >= 24-48GB VRAM ~3x more VRAM without optimizations
Hardware (FP8 quant) NVIDIA H100 or higher Hopper architecture required for FP8
Hardware (INT8 quant) Any NVIDIA GPU Requires torchao from source
Hardware (parallel) 4x NVIDIA GPU Default for xDiT parallel inference
Python >= 3.10
CUDA >= 11.0 CUDA 12.0+ for FP8 quantization

Dependencies

System Packages

  • NVIDIA CUDA Toolkit
  • `ffmpeg` (for video I/O)
  • `libgl1-mesa-glx` (for OpenCV)
  • `libglib2.0-0` (for OpenCV)

Python Packages

  • `torch` >= 2.5.0
  • `torchvision` >= 0.20.0
  • `diffusers` >= 0.31.0
  • `transformers` >= 4.44.0
  • `accelerate` >= 0.34.2
  • `sentencepiece` >= 0.2.0
  • `numpy` == 1.26.0 (pinned)
  • `imageio` >= 2.34.2
  • `imageio-ffmpeg` >= 0.5.1
  • `moviepy` >= 2.0.0
  • `opencv-python` >= 4.10.0.84
  • `gradio` >= 5.4.0 (for web demos)
  • `openai` >= 1.45.0 (for prompt enhancement)
  • `pillow` == 9.5.0 (pinned in Gradio demo)
  • `safetensors` >= 0.4.5

Optional Packages

  • `torchao` — For INT8/FP8 quantization (must install from source)
  • `xfuser` — For xDiT parallel multi-GPU inference
  • `spandrel` >= 0.4.0 — For video upscaling in Gradio demo

Credentials

Optional:

  • `OPENAI_API_KEY`: For prompt enhancement via OpenAI/GLM-4 API
  • `OPENAI_BASE_URL`: Custom OpenAI-compatible API endpoint

Quick Install

# Install core inference dependencies
pip install torch>=2.5.0 torchvision>=0.20.0 diffusers>=0.31.0 transformers>=4.44.0 \
    accelerate>=0.34.2 sentencepiece>=0.2.0 numpy==1.26.0 imageio>=2.34.2 \
    imageio-ffmpeg>=0.5.1 moviepy>=2.0.0 opencv-python>=4.10.0.84 safetensors>=0.4.5

# For Gradio web demos
pip install gradio>=5.4.0 openai>=1.45.0 pillow==9.5.0 spandrel>=0.4.0

# For quantized inference (install torchao from source)
pip install torchao

# For parallel multi-GPU inference
pip install xfuser

Code Evidence

CPU offload configuration from `inference/cli_demo.py:17`:

# You can change `pipe.enable_sequential_cpu_offload()` to
# `pipe.enable_model_cpu_offload()` to speed up inference,
# but this will use more GPU memory

FP8 hardware requirement from `inference/cli_demo_quantization.py:7-9`:

# Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
# ALL quantization schemes must use with NVIDIA GPUs.

TorchDynamo configuration for quantized inference from `inference/cli_demo_quantization.py:33-39`:

torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision("high")
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

Multi-GPU comment from `inference/cli_demo.py:92-93`:

# add device_map="balanced" in the from_pretrained function and
# remove the enable_model_cpu_offload() function to use Multi GPUs.

Common Errors

Error Message Cause Solution
CUDA OOM during generation GPU VRAM insufficient Enable `pipe.enable_sequential_cpu_offload()`; enable VAE slicing and tiling
FP8 quantization fails GPU does not support FP8 Requires H100 or higher; use INT8 for older GPUs
`torch.load doesn't support weights_only` Old PyTorch version Upgrade PyTorch >= 2.0 for safe loading
Tensor shape mismatch with custom resolution Non-I2V model with custom resolution Only I2V models support custom resolution; T2V models force default resolution
Slow inference CPU offload enabled Disable CPU offload if VRAM allows; use `device_map="balanced"` for multi-GPU

Compatibility Notes

  • Single GPU: Use `enable_sequential_cpu_offload()` for minimum VRAM or `enable_model_cpu_offload()` for faster inference with more VRAM.
  • Multi-GPU: Disable CPU offload entirely; use `device_map="balanced"` in `from_pretrained()` instead.
  • FP8 Quantization: Only available on NVIDIA Hopper architecture (H100+) with CUDA 12.0+.
  • INT8 Quantization: Works on any NVIDIA GPU but requires `torchao` installed from source.
  • Parallel Inference (xDiT): `ulysses_degree` must evenly divide the number of attention heads (30). Valid values: 1, 2, 3, 5, 6, 10, 15, 30.
  • numpy: Pinned to 1.26.0. Do not upgrade.
  • pillow: Pinned to 9.5.0 in the Gradio composite demo.
  • Speed vs Memory: Disabling all optimizations (CPU offload, slicing, tiling) gives 3-4x speedup at 3x memory cost.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment