Implementation:Zai org CogVideo Parallel Inference xDiT
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Distributed_Computing, Performance_Optimization |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Parallel Inference xDiT is a multi-GPU parallel inference script that uses the xDiT (xFuser) distributed framework to accelerate CogVideoX video generation across multiple GPUs.
Description
This script leverages the xFuserCogVideoXPipeline from the xfuser library to distribute CogVideoX video generation across multiple GPUs using a combination of parallelism strategies:
- Ulysses Attention Parallelism: Splits attention computation across GPUs along the sequence dimension. Requires that the number of attention heads (30 for CogVideoX) is divisible by the Ulysses degree.
- Ring Attention: Distributes attention computation in a ring topology for memory-efficient long-sequence processing.
- Tensor Parallelism: Splits model weight tensors across GPUs.
- CFG Parallelism: Distributes classifier-free guidance computations (conditional and unconditional) across separate GPUs.
- PipeFusion Parallel: Pipeline parallelism for the diffusion denoising process.
The script validates parallelism configuration (checking that ulysses_degree divides the number of attention heads), enables VAE slicing and tiling to prevent out-of-memory errors, and runs inference with dynamic CFG at guidance scale 6. After generation, it reports elapsed time and peak GPU memory, and saves the output video with the parallelism configuration encoded in the filename.
Usage
Use this script when CogVideoX inference on a single GPU is too slow or memory-constrained. It enables significantly faster video generation by distributing computation across 2, 4, or more GPUs, making it suitable for production-scale or near-real-time video generation workflows.
Code Reference
Source Location
- Repository: Zai_org_CogVideo
- File:
tools/parallel_inference/parallel_inference_xdit.py
Entry Point
def main():
parser = FlexibleArgumentParser(description="xFuser Arguments")
args = xFuserArgs.add_cli_args(parser).parse_args()
engine_args = xFuserArgs.from_cli_args(args)
engine_config, input_config = engine_args.create_config()
local_rank = get_world_group().local_rank
pipe = xFuserCogVideoXPipeline.from_pretrained(
pretrained_model_name_or_path=engine_config.model_config.model,
engine_config=engine_config,
torch_dtype=torch.bfloat16,
)
Import
# This is a distributed script; run via torchrun:
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
# --model <model-path> --ulysses_degree 1 --ring_degree 2 --use_cfg_parallel
I/O Contract
Inputs (via xFuserArgs CLI)
| Name | Type | Description |
|---|---|---|
--model |
str |
Path to the CogVideoX pretrained model. |
--ulysses_degree |
int |
Degree of Ulysses attention parallelism. Must divide 30 (number of attention heads). |
--ring_degree |
int |
Degree of ring attention parallelism. |
--use_cfg_parallel |
flag | Enable CFG parallelism across GPUs. |
--height |
int |
Video height in pixels. |
--width |
int |
Video width in pixels. |
--num_frames |
int |
Number of video frames to generate. |
--prompt |
str |
Text prompt for video generation. |
--num_inference_steps |
int |
Number of diffusion denoising steps. |
--seed |
int |
Random seed for reproducibility. |
--enable_sequential_cpu_offload |
flag | Offload model layers to CPU sequentially to save GPU memory. |
--tensor_parallel_degree |
int |
Degree of tensor parallelism. |
--pipefusion_parallel_degree |
int |
Degree of pipeline fusion parallelism. |
--data_parallel_degree |
int |
Degree of data parallelism. |
Outputs
| Name | Type | Description |
|---|---|---|
| Video file | MP4 | Saved to results/cogvideox_{parallel_config}_{resolution}.mp4 at 8 FPS.
|
| Console output | text | Elapsed time (seconds) and peak GPU memory (GB). |
Pipeline Configuration
| Parameter | Value | Description |
|---|---|---|
| Precision | torch.bfloat16 |
Reduced precision for memory efficiency. |
| Guidance scale | 6 | Classifier-free guidance strength. |
| Dynamic CFG | Enabled | Cosine-scheduled guidance for better quality. |
| VAE slicing | Enabled | Processes VAE input in slices to reduce memory. |
| VAE tiling | Enabled | Processes VAE input in tiles to reduce memory. |
| Attention heads | 30 | CogVideoX transformer head count (constrains Ulysses degree). |
Usage Examples
# Run on 4 GPUs with ring attention degree 2 and CFG parallelism
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
# --model THUDM/CogVideoX-5B \
# --ulysses_degree 1 \
# --ring_degree 2 \
# --use_cfg_parallel \
# --height 480 \
# --width 720 \
# --num_frames 49 \
# --prompt "A small dog running on a beach at sunset."
# Run on 2 GPUs with Ulysses attention parallelism
# torchrun --nproc_per_node=2 tools/parallel_inference/parallel_inference_xdit.py \
# --model THUDM/CogVideoX-5B \
# --ulysses_degree 2 \
# --ring_degree 1 \
# --height 480 \
# --width 720 \
# --num_frames 9 \
# --prompt "A city skyline at dusk with lights turning on."