Environment:Zai org CogVideo Diffusers Inference Environment
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Inference, Deep_Learning |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Linux GPU environment with Python 3.10+, PyTorch >= 2.5.0, CUDA, and HuggingFace Diffusers >= 0.31.0 for text-to-video, image-to-video, and video editing inference with CogVideoX models.
Description
This environment provides the runtime stack for generating videos using pre-trained CogVideoX models via the HuggingFace Diffusers pipeline. It supports text-to-video (T2V), image-to-video (I2V), and DDIM-based video editing inference. The pipeline supports sequential CPU offload and model-level CPU offload for memory-constrained GPUs, VAE slicing/tiling for large videos, and optional INT8/FP8 quantization for reduced memory usage.
Usage
Use this environment for all inference workflows: text-to-video generation, image-to-video generation, DDIM inversion-based video editing, and LoRA weight loading for inference. This is the prerequisite for running `inference/cli_demo.py`, `inference/cli_demo_quantization.py`, `inference/ddim_inversion.py`, and the Gradio demo applications.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | CUDA-compatible OS required |
| Hardware (standard) | NVIDIA GPU >= 12GB VRAM | With CPU offload enabled |
| Hardware (no offload) | NVIDIA GPU >= 24-48GB VRAM | ~3x more VRAM without optimizations |
| Hardware (FP8 quant) | NVIDIA H100 or higher | Hopper architecture required for FP8 |
| Hardware (INT8 quant) | Any NVIDIA GPU | Requires torchao from source |
| Hardware (parallel) | 4x NVIDIA GPU | Default for xDiT parallel inference |
| Python | >= 3.10 | |
| CUDA | >= 11.0 | CUDA 12.0+ for FP8 quantization |
Dependencies
System Packages
- NVIDIA CUDA Toolkit
- `ffmpeg` (for video I/O)
- `libgl1-mesa-glx` (for OpenCV)
- `libglib2.0-0` (for OpenCV)
Python Packages
- `torch` >= 2.5.0
- `torchvision` >= 0.20.0
- `diffusers` >= 0.31.0
- `transformers` >= 4.44.0
- `accelerate` >= 0.34.2
- `sentencepiece` >= 0.2.0
- `numpy` == 1.26.0 (pinned)
- `imageio` >= 2.34.2
- `imageio-ffmpeg` >= 0.5.1
- `moviepy` >= 2.0.0
- `opencv-python` >= 4.10.0.84
- `gradio` >= 5.4.0 (for web demos)
- `openai` >= 1.45.0 (for prompt enhancement)
- `pillow` == 9.5.0 (pinned in Gradio demo)
- `safetensors` >= 0.4.5
Optional Packages
- `torchao` — For INT8/FP8 quantization (must install from source)
- `xfuser` — For xDiT parallel multi-GPU inference
- `spandrel` >= 0.4.0 — For video upscaling in Gradio demo
Credentials
Optional:
- `OPENAI_API_KEY`: For prompt enhancement via OpenAI/GLM-4 API
- `OPENAI_BASE_URL`: Custom OpenAI-compatible API endpoint
Quick Install
# Install core inference dependencies
pip install torch>=2.5.0 torchvision>=0.20.0 diffusers>=0.31.0 transformers>=4.44.0 \
accelerate>=0.34.2 sentencepiece>=0.2.0 numpy==1.26.0 imageio>=2.34.2 \
imageio-ffmpeg>=0.5.1 moviepy>=2.0.0 opencv-python>=4.10.0.84 safetensors>=0.4.5
# For Gradio web demos
pip install gradio>=5.4.0 openai>=1.45.0 pillow==9.5.0 spandrel>=0.4.0
# For quantized inference (install torchao from source)
pip install torchao
# For parallel multi-GPU inference
pip install xfuser
Code Evidence
CPU offload configuration from `inference/cli_demo.py:17`:
# You can change `pipe.enable_sequential_cpu_offload()` to
# `pipe.enable_model_cpu_offload()` to speed up inference,
# but this will use more GPU memory
FP8 hardware requirement from `inference/cli_demo_quantization.py:7-9`:
# Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
# ALL quantization schemes must use with NVIDIA GPUs.
TorchDynamo configuration for quantized inference from `inference/cli_demo_quantization.py:33-39`:
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision("high")
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True
Multi-GPU comment from `inference/cli_demo.py:92-93`:
# add device_map="balanced" in the from_pretrained function and
# remove the enable_model_cpu_offload() function to use Multi GPUs.
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| CUDA OOM during generation | GPU VRAM insufficient | Enable `pipe.enable_sequential_cpu_offload()`; enable VAE slicing and tiling |
| FP8 quantization fails | GPU does not support FP8 | Requires H100 or higher; use INT8 for older GPUs |
| `torch.load doesn't support weights_only` | Old PyTorch version | Upgrade PyTorch >= 2.0 for safe loading |
| Tensor shape mismatch with custom resolution | Non-I2V model with custom resolution | Only I2V models support custom resolution; T2V models force default resolution |
| Slow inference | CPU offload enabled | Disable CPU offload if VRAM allows; use `device_map="balanced"` for multi-GPU |
Compatibility Notes
- Single GPU: Use `enable_sequential_cpu_offload()` for minimum VRAM or `enable_model_cpu_offload()` for faster inference with more VRAM.
- Multi-GPU: Disable CPU offload entirely; use `device_map="balanced"` in `from_pretrained()` instead.
- FP8 Quantization: Only available on NVIDIA Hopper architecture (H100+) with CUDA 12.0+.
- INT8 Quantization: Works on any NVIDIA GPU but requires `torchao` installed from source.
- Parallel Inference (xDiT): `ulysses_degree` must evenly divide the number of attention heads (30). Valid values: 1, 2, 3, 5, 6, 10, 15, 30.
- numpy: Pinned to 1.26.0. Do not upgrade.
- pillow: Pinned to 9.5.0 in the Gradio composite demo.
- Speed vs Memory: Disabling all optimizations (CPU offload, slicing, tiling) gives 3-4x speedup at 3x memory cost.
Related Pages
- Implementation:Zai_org_CogVideo_CogVideoXPipeline_From_Pretrained
- Implementation:Zai_org_CogVideo_Inference_Load_Lora_Weights
- Implementation:Zai_org_CogVideo_CogVideoXDPMScheduler_From_Config
- Implementation:Zai_org_CogVideo_Pipeline_CPU_Offload
- Implementation:Zai_org_CogVideo_CogVideoXPipeline_Call
- Implementation:Zai_org_CogVideo_Export_To_Video
- Implementation:Zai_org_CogVideo_CogVideoXI2VPipeline_From_Pretrained
- Implementation:Zai_org_CogVideo_Load_Image
- Implementation:Zai_org_CogVideo_I2V_Pipeline_Configuration
- Implementation:Zai_org_CogVideo_CogVideoXI2VPipeline_Call
- Implementation:Zai_org_CogVideo_I2V_Export_To_Video
- Implementation:Zai_org_CogVideo_Get_Video_Frames
- Implementation:Zai_org_CogVideo_DDIM_CogVideoXPipeline_From_Pretrained
- Implementation:Zai_org_CogVideo_Encode_Video_Frames
- Implementation:Zai_org_CogVideo_DDIM_Inversion_Sample
- Implementation:Zai_org_CogVideo_DDIM_Attention_Injection_Reconstruction
- Implementation:Zai_org_CogVideo_DDIM_Export_Latents_To_Video