Workflow:Zai org CogVideo Diffusers Text to Video Inference

Knowledge Sources	CogVideo HuggingFace Diffusers CogVideoX Models
Domains	Video_Generation, Inference, Text_to_Video
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for generating videos from text prompts using pre-trained CogVideoX models via the HuggingFace Diffusers pipeline.

Description

This workflow covers the complete procedure for text-to-video generation using CogVideoX. It loads a pre-trained CogVideoX model through the Diffusers library, configures the diffusion scheduler, applies memory optimizations (CPU offloading, VAE tiling/slicing), runs the denoising diffusion process conditioned on a text prompt, and exports the generated frames as an MP4 video. The workflow supports all CogVideoX T2V model variants (2B, 5B, 1.5-5B) and optionally loads LoRA adapter weights for fine-tuned generation.

Usage

Execute this workflow when you have a text description and want to generate a corresponding video. This is the primary inference pathway for CogVideoX and requires a pre-trained model (downloaded from HuggingFace Hub) and a GPU with sufficient VRAM (minimum ~6GB with sequential CPU offloading, ~16GB with model-level offloading).

Execution Steps

Step 1: Model Selection and Loading

Select the appropriate CogVideoX model variant based on quality requirements and available hardware. Load the pre-trained CogVideoXPipeline from HuggingFace Hub or a local path. The pipeline loads three components: the CogVideoX transformer (the denoising backbone), the T5-XXL text encoder, and the 3D VAE decoder. All weights are loaded in bfloat16 precision by default.

Key considerations:

CogVideoX-2B: Smallest, fastest, supports fp16
CogVideoX-5B: Higher quality, requires more VRAM
CogVideoX1.5-5B: Latest variant, higher resolution (768x1360), 81 frames at 16fps
Default resolution is automatically selected based on model variant

Step 2: Optional LoRA Weight Loading

If using a fine-tuned model, load LoRA adapter weights into the pipeline. The LoRA weights are loaded from a safetensors file and fused into the transformer component. This step is skipped for base model inference.

Key considerations:

LoRA weights are loaded via `load_lora_weights` with a specified adapter name
Fusion via `fuse_lora` merges adapters into the base weights for faster inference
LoRA scale can be adjusted to control the strength of the adaptation

Step 3: Scheduler Configuration

Configure the diffusion noise scheduler that controls the denoising process. The default scheduler is CogVideoXDPMScheduler for 5B models and CogVideoXDDIMScheduler for 2B models. Both use trailing timestep spacing for improved generation quality.

Key considerations:

DPM scheduler is recommended for CogVideoX-5B and 1.5-5B
DDIM scheduler is recommended for CogVideoX-2B
Timestep spacing should be set to "trailing"

Step 4: Memory Optimization

Apply memory-saving techniques to enable inference on consumer GPUs. Sequential CPU offloading moves individual model components to GPU only when needed. VAE slicing processes video frames one at a time rather than in batch. VAE tiling splits large latent tensors into overlapping tiles for decoding.

Key considerations:

`enable_sequential_cpu_offload()` minimizes VRAM usage but is slower
`enable_model_cpu_offload()` is faster but uses more VRAM
`vae.enable_slicing()` reduces peak memory during VAE decoding
`vae.enable_tiling()` enables generation at higher resolutions
For multi-GPU setups, use `device_map="balanced"` instead of CPU offloading

Step 5: Video Generation

Run the diffusion pipeline to generate video frames. The pipeline encodes the text prompt using T5-XXL, initializes random noise in latent space, iteratively denoises the latent representation using classifier-free guidance, and decodes the final latents to video frames through the 3D VAE.

Key considerations:

`num_inference_steps` controls quality vs speed tradeoff (default 50)
`guidance_scale` controls prompt adherence (default 6.0)
`use_dynamic_cfg=True` applies dynamic classifier-free guidance for DPM scheduler
`num_frames` determines output length (49 for 6s at 8fps, 81 for 5s at 16fps)
A fixed seed ensures reproducible results

Step 6: Video Export

Convert the generated frame tensors to an MP4 video file. The frames are exported at the configured framerate (default 16 fps for CogVideoX1.5, 8 fps for CogVideoX).

Key considerations:

Output format is MP4 via the `export_to_video` utility
Frame rate should match the model's training configuration
Generated resolution depends on the model variant used

Execution Diagram

GitHub URL

Workflow Repository