Workflow:Zai org CogVideo Diffusers Text to Video Inference
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Inference, Text_to_Video |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for generating videos from text prompts using pre-trained CogVideoX models via the HuggingFace Diffusers pipeline.
Description
This workflow covers the complete procedure for text-to-video generation using CogVideoX. It loads a pre-trained CogVideoX model through the Diffusers library, configures the diffusion scheduler, applies memory optimizations (CPU offloading, VAE tiling/slicing), runs the denoising diffusion process conditioned on a text prompt, and exports the generated frames as an MP4 video. The workflow supports all CogVideoX T2V model variants (2B, 5B, 1.5-5B) and optionally loads LoRA adapter weights for fine-tuned generation.
Usage
Execute this workflow when you have a text description and want to generate a corresponding video. This is the primary inference pathway for CogVideoX and requires a pre-trained model (downloaded from HuggingFace Hub) and a GPU with sufficient VRAM (minimum ~6GB with sequential CPU offloading, ~16GB with model-level offloading).
Execution Steps
Step 1: Model Selection and Loading
Select the appropriate CogVideoX model variant based on quality requirements and available hardware. Load the pre-trained CogVideoXPipeline from HuggingFace Hub or a local path. The pipeline loads three components: the CogVideoX transformer (the denoising backbone), the T5-XXL text encoder, and the 3D VAE decoder. All weights are loaded in bfloat16 precision by default.
Key considerations:
- CogVideoX-2B: Smallest, fastest, supports fp16
- CogVideoX-5B: Higher quality, requires more VRAM
- CogVideoX1.5-5B: Latest variant, higher resolution (768x1360), 81 frames at 16fps
- Default resolution is automatically selected based on model variant
Step 2: Optional LoRA Weight Loading
If using a fine-tuned model, load LoRA adapter weights into the pipeline. The LoRA weights are loaded from a safetensors file and fused into the transformer component. This step is skipped for base model inference.
Key considerations:
- LoRA weights are loaded via `load_lora_weights` with a specified adapter name
- Fusion via `fuse_lora` merges adapters into the base weights for faster inference
- LoRA scale can be adjusted to control the strength of the adaptation
Step 3: Scheduler Configuration
Configure the diffusion noise scheduler that controls the denoising process. The default scheduler is CogVideoXDPMScheduler for 5B models and CogVideoXDDIMScheduler for 2B models. Both use trailing timestep spacing for improved generation quality.
Key considerations:
- DPM scheduler is recommended for CogVideoX-5B and 1.5-5B
- DDIM scheduler is recommended for CogVideoX-2B
- Timestep spacing should be set to "trailing"
Step 4: Memory Optimization
Apply memory-saving techniques to enable inference on consumer GPUs. Sequential CPU offloading moves individual model components to GPU only when needed. VAE slicing processes video frames one at a time rather than in batch. VAE tiling splits large latent tensors into overlapping tiles for decoding.
Key considerations:
- `enable_sequential_cpu_offload()` minimizes VRAM usage but is slower
- `enable_model_cpu_offload()` is faster but uses more VRAM
- `vae.enable_slicing()` reduces peak memory during VAE decoding
- `vae.enable_tiling()` enables generation at higher resolutions
- For multi-GPU setups, use `device_map="balanced"` instead of CPU offloading
Step 5: Video Generation
Run the diffusion pipeline to generate video frames. The pipeline encodes the text prompt using T5-XXL, initializes random noise in latent space, iteratively denoises the latent representation using classifier-free guidance, and decodes the final latents to video frames through the 3D VAE.
Key considerations:
- `num_inference_steps` controls quality vs speed tradeoff (default 50)
- `guidance_scale` controls prompt adherence (default 6.0)
- `use_dynamic_cfg=True` applies dynamic classifier-free guidance for DPM scheduler
- `num_frames` determines output length (49 for 6s at 8fps, 81 for 5s at 16fps)
- A fixed seed ensures reproducible results
Step 6: Video Export
Convert the generated frame tensors to an MP4 video file. The frames are exported at the configured framerate (default 16 fps for CogVideoX1.5, 8 fps for CogVideoX).
Key considerations:
- Output format is MP4 via the `export_to_video` utility
- Frame rate should match the model's training configuration
- Generated resolution depends on the model variant used