Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Zai org CogVideo Parallel Inference xDiT

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Distributed_Computing, Performance_Optimization
Last Updated 2026-02-10 00:00 GMT

Overview

Parallel Inference xDiT is a multi-GPU parallel inference script that uses the xDiT (xFuser) distributed framework to accelerate CogVideoX video generation across multiple GPUs.

Description

This script leverages the xFuserCogVideoXPipeline from the xfuser library to distribute CogVideoX video generation across multiple GPUs using a combination of parallelism strategies:

  • Ulysses Attention Parallelism: Splits attention computation across GPUs along the sequence dimension. Requires that the number of attention heads (30 for CogVideoX) is divisible by the Ulysses degree.
  • Ring Attention: Distributes attention computation in a ring topology for memory-efficient long-sequence processing.
  • Tensor Parallelism: Splits model weight tensors across GPUs.
  • CFG Parallelism: Distributes classifier-free guidance computations (conditional and unconditional) across separate GPUs.
  • PipeFusion Parallel: Pipeline parallelism for the diffusion denoising process.

The script validates parallelism configuration (checking that ulysses_degree divides the number of attention heads), enables VAE slicing and tiling to prevent out-of-memory errors, and runs inference with dynamic CFG at guidance scale 6. After generation, it reports elapsed time and peak GPU memory, and saves the output video with the parallelism configuration encoded in the filename.

Usage

Use this script when CogVideoX inference on a single GPU is too slow or memory-constrained. It enables significantly faster video generation by distributing computation across 2, 4, or more GPUs, making it suitable for production-scale or near-real-time video generation workflows.

Code Reference

Source Location

  • Repository: Zai_org_CogVideo
  • File: tools/parallel_inference/parallel_inference_xdit.py

Entry Point

def main():
    parser = FlexibleArgumentParser(description="xFuser Arguments")
    args = xFuserArgs.add_cli_args(parser).parse_args()
    engine_args = xFuserArgs.from_cli_args(args)
    engine_config, input_config = engine_args.create_config()
    local_rank = get_world_group().local_rank
    pipe = xFuserCogVideoXPipeline.from_pretrained(
        pretrained_model_name_or_path=engine_config.model_config.model,
        engine_config=engine_config,
        torch_dtype=torch.bfloat16,
    )

Import

# This is a distributed script; run via torchrun:
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
#     --model <model-path> --ulysses_degree 1 --ring_degree 2 --use_cfg_parallel

I/O Contract

Inputs (via xFuserArgs CLI)

Name Type Description
--model str Path to the CogVideoX pretrained model.
--ulysses_degree int Degree of Ulysses attention parallelism. Must divide 30 (number of attention heads).
--ring_degree int Degree of ring attention parallelism.
--use_cfg_parallel flag Enable CFG parallelism across GPUs.
--height int Video height in pixels.
--width int Video width in pixels.
--num_frames int Number of video frames to generate.
--prompt str Text prompt for video generation.
--num_inference_steps int Number of diffusion denoising steps.
--seed int Random seed for reproducibility.
--enable_sequential_cpu_offload flag Offload model layers to CPU sequentially to save GPU memory.
--tensor_parallel_degree int Degree of tensor parallelism.
--pipefusion_parallel_degree int Degree of pipeline fusion parallelism.
--data_parallel_degree int Degree of data parallelism.

Outputs

Name Type Description
Video file MP4 Saved to results/cogvideox_{parallel_config}_{resolution}.mp4 at 8 FPS.
Console output text Elapsed time (seconds) and peak GPU memory (GB).

Pipeline Configuration

Parameter Value Description
Precision torch.bfloat16 Reduced precision for memory efficiency.
Guidance scale 6 Classifier-free guidance strength.
Dynamic CFG Enabled Cosine-scheduled guidance for better quality.
VAE slicing Enabled Processes VAE input in slices to reduce memory.
VAE tiling Enabled Processes VAE input in tiles to reduce memory.
Attention heads 30 CogVideoX transformer head count (constrains Ulysses degree).

Usage Examples

# Run on 4 GPUs with ring attention degree 2 and CFG parallelism
# torchrun --nproc_per_node=4 tools/parallel_inference/parallel_inference_xdit.py \
#     --model THUDM/CogVideoX-5B \
#     --ulysses_degree 1 \
#     --ring_degree 2 \
#     --use_cfg_parallel \
#     --height 480 \
#     --width 720 \
#     --num_frames 49 \
#     --prompt "A small dog running on a beach at sunset."

# Run on 2 GPUs with Ulysses attention parallelism
# torchrun --nproc_per_node=2 tools/parallel_inference/parallel_inference_xdit.py \
#     --model THUDM/CogVideoX-5B \
#     --ulysses_degree 2 \
#     --ring_degree 1 \
#     --height 480 \
#     --width 720 \
#     --num_frames 9 \
#     --prompt "A city skyline at dusk with lights turning on."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment