Environment:Zai org CogVideo Diffusers Inference Environment

Knowledge Sources	CogVideo Diffusers
Domains	Video_Generation, Inference, Deep_Learning
Last Updated	2026-02-10 02:00 GMT

Overview

Linux GPU environment with Python 3.10+, PyTorch >= 2.5.0, CUDA, and HuggingFace Diffusers >= 0.31.0 for text-to-video, image-to-video, and video editing inference with CogVideoX models.

Description

This environment provides the runtime stack for generating videos using pre-trained CogVideoX models via the HuggingFace Diffusers pipeline. It supports text-to-video (T2V), image-to-video (I2V), and DDIM-based video editing inference. The pipeline supports sequential CPU offload and model-level CPU offload for memory-constrained GPUs, VAE slicing/tiling for large videos, and optional INT8/FP8 quantization for reduced memory usage.

Usage

Use this environment for all inference workflows: text-to-video generation, image-to-video generation, DDIM inversion-based video editing, and LoRA weight loading for inference. This is the prerequisite for running `inference/cli_demo.py`, `inference/cli_demo_quantization.py`, `inference/ddim_inversion.py`, and the Gradio demo applications.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	CUDA-compatible OS required
Hardware (standard)	NVIDIA GPU >= 12GB VRAM	With CPU offload enabled
Hardware (no offload)	NVIDIA GPU >= 24-48GB VRAM	~3x more VRAM without optimizations
Hardware (FP8 quant)	NVIDIA H100 or higher	Hopper architecture required for FP8
Hardware (INT8 quant)	Any NVIDIA GPU	Requires torchao from source
Hardware (parallel)	4x NVIDIA GPU	Default for xDiT parallel inference
Python	>= 3.10
CUDA	>= 11.0	CUDA 12.0+ for FP8 quantization

Dependencies

System Packages

NVIDIA CUDA Toolkit
`ffmpeg` (for video I/O)
`libgl1-mesa-glx` (for OpenCV)
`libglib2.0-0` (for OpenCV)

Python Packages

`torch` >= 2.5.0
`torchvision` >= 0.20.0
`diffusers` >= 0.31.0
`transformers` >= 4.44.0
`accelerate` >= 0.34.2
`sentencepiece` >= 0.2.0
`numpy` == 1.26.0 (pinned)
`imageio` >= 2.34.2
`imageio-ffmpeg` >= 0.5.1
`moviepy` >= 2.0.0
`opencv-python` >= 4.10.0.84
`gradio` >= 5.4.0 (for web demos)
`openai` >= 1.45.0 (for prompt enhancement)
`pillow` == 9.5.0 (pinned in Gradio demo)
`safetensors` >= 0.4.5

Optional Packages

`torchao` — For INT8/FP8 quantization (must install from source)
`xfuser` — For xDiT parallel multi-GPU inference
`spandrel` >= 0.4.0 — For video upscaling in Gradio demo

Credentials

Optional:

`OPENAI_API_KEY`: For prompt enhancement via OpenAI/GLM-4 API
`OPENAI_BASE_URL`: Custom OpenAI-compatible API endpoint

Quick Install

# Install core inference dependencies
pip install torch>=2.5.0 torchvision>=0.20.0 diffusers>=0.31.0 transformers>=4.44.0 \
    accelerate>=0.34.2 sentencepiece>=0.2.0 numpy==1.26.0 imageio>=2.34.2 \
    imageio-ffmpeg>=0.5.1 moviepy>=2.0.0 opencv-python>=4.10.0.84 safetensors>=0.4.5

# For Gradio web demos
pip install gradio>=5.4.0 openai>=1.45.0 pillow==9.5.0 spandrel>=0.4.0

# For quantized inference (install torchao from source)
pip install torchao

# For parallel multi-GPU inference
pip install xfuser

Code Evidence

CPU offload configuration from `inference/cli_demo.py:17`:

# You can change `pipe.enable_sequential_cpu_offload()` to
# `pipe.enable_model_cpu_offload()` to speed up inference,
# but this will use more GPU memory

FP8 hardware requirement from `inference/cli_demo_quantization.py:7-9`:

# Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
# ALL quantization schemes must use with NVIDIA GPUs.

TorchDynamo configuration for quantized inference from `inference/cli_demo_quantization.py:33-39`:

torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision("high")
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

Multi-GPU comment from `inference/cli_demo.py:92-93`:

# add device_map="balanced" in the from_pretrained function and
# remove the enable_model_cpu_offload() function to use Multi GPUs.

Common Errors

Error Message	Cause	Solution
CUDA OOM during generation	GPU VRAM insufficient	Enable `pipe.enable_sequential_cpu_offload()`; enable VAE slicing and tiling
FP8 quantization fails	GPU does not support FP8	Requires H100 or higher; use INT8 for older GPUs
`torch.load doesn't support weights_only`	Old PyTorch version	Upgrade PyTorch >= 2.0 for safe loading
Tensor shape mismatch with custom resolution	Non-I2V model with custom resolution	Only I2V models support custom resolution; T2V models force default resolution
Slow inference	CPU offload enabled	Disable CPU offload if VRAM allows; use `device_map="balanced"` for multi-GPU

Compatibility Notes

Single GPU: Use `enable_sequential_cpu_offload()` for minimum VRAM or `enable_model_cpu_offload()` for faster inference with more VRAM.
Multi-GPU: Disable CPU offload entirely; use `device_map="balanced"` in `from_pretrained()` instead.
FP8 Quantization: Only available on NVIDIA Hopper architecture (H100+) with CUDA 12.0+.
INT8 Quantization: Works on any NVIDIA GPU but requires `torchao` installed from source.
Parallel Inference (xDiT): `ulysses_degree` must evenly divide the number of attention heads (30). Valid values: 1, 2, 3, 5, 6, 10, 15, 30.
numpy: Pinned to 1.26.0. Do not upgrade.
pillow: Pinned to 9.5.0 in the Gradio composite demo.
Speed vs Memory: Disabling all optimizations (CPU offload, slicing, tiling) gives 3-4x speedup at 3x memory cost.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment