Implementation:Zai org CogVideo Parallel Inference xDiT

Knowledge Sources	Zai_org_CogVideo xDiT
Domains	Video_Generation, Distributed_Computing, Performance_Optimization
Last Updated	2026-02-10 00:00 GMT

Overview

Parallel Inference xDiT is a multi-GPU parallel inference script that uses the xDiT (xFuser) distributed framework to accelerate CogVideoX video generation across multiple GPUs.

Description

This script leverages the xFuserCogVideoXPipeline from the xfuser library to distribute CogVideoX video generation across multiple GPUs using a combination of parallelism strategies:

Ulysses Attention Parallelism: Splits attention computation across GPUs along the sequence dimension. Requires that the number of attention heads (30 for CogVideoX) is divisible by the Ulysses degree.
Ring Attention: Distributes attention computation in a ring topology for memory-efficient long-sequence processing.
Tensor Parallelism: Splits model weight tensors across GPUs.
CFG Parallelism: Distributes classifier-free guidance computations (conditional and unconditional) across separate GPUs.
PipeFusion Parallel: Pipeline parallelism for the diffusion denoising process.

The script validates parallelism configuration (checking that ulysses_degree divides the number of attention heads), enables VAE slicing and tiling to prevent out-of-memory errors, and runs inference with dynamic CFG at guidance scale 6. After generation, it reports elapsed time and peak GPU memory, and saves the output video with the parallelism configuration encoded in the filename.

Usage

Use this script when CogVideoX inference on a single GPU is too slow or memory-constrained. It enables significantly faster video generation by distributing computation across 2, 4, or more GPUs, making it suitable for production-scale or near-real-time video generation workflows.

Code Reference

Source Location

Repository: Zai_org_CogVideo
File: tools/parallel_inference/parallel_inference_xdit.py

Entry Point

def main():
    parser = FlexibleArgumentParser(description="xFuser Arguments")
    args = xFuserArgs.add_cli_args(parser).parse_args()
    engine_args = xFuserArgs.from_cli_args(args)
    engine_config, input_config = engine_args.create_config()
    local_rank = get_world_group().local_rank
    pipe = xFuserCogVideoXPipeline.from_pretrained(
        pretrained_model_name_or_path=engine_config.model_config.model,
        engine_config=engine_config,
        torch_dtype=torch.bfloat16,
    )

Import

# This is a distributed script; run via torchrun:
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
#     --model <model-path> --ulysses_degree 1 --ring_degree 2 --use_cfg_parallel

I/O Contract

Inputs (via xFuserArgs CLI)

Name	Type	Description
`--model`	`str`	Path to the CogVideoX pretrained model.
`--ulysses_degree`	`int`	Degree of Ulysses attention parallelism. Must divide 30 (number of attention heads).
`--ring_degree`	`int`	Degree of ring attention parallelism.
`--use_cfg_parallel`	flag	Enable CFG parallelism across GPUs.
`--height`	`int`	Video height in pixels.
`--width`	`int`	Video width in pixels.
`--num_frames`	`int`	Number of video frames to generate.
`--prompt`	`str`	Text prompt for video generation.
`--num_inference_steps`	`int`	Number of diffusion denoising steps.
`--seed`	`int`	Random seed for reproducibility.
`--enable_sequential_cpu_offload`	flag	Offload model layers to CPU sequentially to save GPU memory.
`--tensor_parallel_degree`	`int`	Degree of tensor parallelism.
`--pipefusion_parallel_degree`	`int`	Degree of pipeline fusion parallelism.
`--data_parallel_degree`	`int`	Degree of data parallelism.

Outputs

Name	Type	Description
Video file	MP4	Saved to `results/cogvideox_{parallel_config}_{resolution}.mp4` at 8 FPS.
Console output	text	Elapsed time (seconds) and peak GPU memory (GB).

Pipeline Configuration

Parameter	Value	Description
Precision	`torch.bfloat16`	Reduced precision for memory efficiency.
Guidance scale	6	Classifier-free guidance strength.
Dynamic CFG	Enabled	Cosine-scheduled guidance for better quality.
VAE slicing	Enabled	Processes VAE input in slices to reduce memory.
VAE tiling	Enabled	Processes VAE input in tiles to reduce memory.
Attention heads	30	CogVideoX transformer head count (constrains Ulysses degree).

Usage Examples

# Run on 4 GPUs with ring attention degree 2 and CFG parallelism
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
#     --model THUDM/CogVideoX-5B \
#     --ulysses_degree 1 \
#     --ring_degree 2 \
#     --use_cfg_parallel \
#     --height 480 \
#     --width 720 \
#     --num_frames 49 \
#     --prompt "A small dog running on a beach at sunset."

# Run on 2 GPUs with Ulysses attention parallelism
# torchrun --nproc_per_node=2 tools/parallel_inference/parallel_inference_xdit.py \
#     --model THUDM/CogVideoX-5B \
#     --ulysses_degree 2 \
#     --ring_degree 1 \
#     --height 480 \
#     --width 720 \
#     --num_frames 9 \
#     --prompt "A city skyline at dusk with lights turning on."

Related Pages

Principle:Zai_org_CogVideo_Parallel_Video_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment