Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Diffusers Video Pipeline Selection

From Leeroopedia
Property Value
Principle Name Video Pipeline Selection
Overview Selecting the appropriate video generation pipeline based on model architecture, balancing quality, speed, and resource constraints
Domains Video Generation, Diffusion Models
Related Implementation Huggingface_Diffusers_Video_Pipeline_From_Pretrained
Knowledge Sources Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/pipelines/wan/pipeline_wan.py, src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py, src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py)
Last Updated 2026-02-13 00:00 GMT

Description

Video generation in Diffusers revolves around selecting a pipeline class matched to the underlying model architecture. Each pipeline encapsulates a complete text-to-video (or image-to-video) generation workflow including text encoding, latent preparation, iterative denoising, and video decoding. The key architectural families are:

  1. Wan (WanPipeline) - Uses a WanTransformer3DModel with 3D patch embeddings, rotary position embeddings, and a UMT5 text encoder. Supports 1.3B and 14B parameter variants, two-stage denoising via optional transformer_2, and FlowMatchEulerDiscreteScheduler.
  2. HunyuanVideo (HunyuanVideoPipeline) - Uses a HunyuanVideoTransformer3DModel with dual text encoders (Llama + CLIP), embedded guidance scale, and FlowMatchEulerDiscreteScheduler. Generates 720p video at 129 frames.
  3. CogVideoX (CogVideoXPipeline) - Uses a CogVideoXTransformer3DModel with T5 text encoder, 3D rotary positional embeddings, and CogVideoX-specific schedulers (DDIM/DPM). Supports 2B and 5B variants.

Theoretical Basis

Video Diffusion Architectures

Video diffusion models extend image diffusion by operating on 5D latent tensors with shape (B, C, F, H, W) where F is the number of latent frames. The core differences between architectures lie in:

  • 3D Attention Mechanisms - All video transformers process spatial and temporal dimensions jointly. Wan uses factored rotary embeddings across (t, h, w) dimensions. CogVideoX uses 3D rotary position embeddings with crop-aware grid computation. HunyuanVideo uses full 3D attention with embedded guidance.
  • Text Conditioning Strategy - Wan uses UMT5-XXL for text encoding with cross-attention. HunyuanVideo uses a dual-encoder approach (Llama for rich semantics + CLIP for pooled embeddings). CogVideoX uses T5 with classifier-free guidance via negative prompt concatenation.
  • Temporal Compression - Each VAE has a different temporal compression ratio. Wan uses scale_factor_temporal=4 (81 frames -> 21 latent frames). HunyuanVideo and CogVideoX also use temporal compression ratio 4.
  • Guidance Mechanism - Wan and CogVideoX use classifier-free guidance (two forward passes per step). HunyuanVideo uses embedded guidance (a single forward pass with a guidance embedding), plus optional true CFG via true_cfg_scale.

Pipeline Selection Criteria

Criterion Wan HunyuanVideo CogVideoX
Parameters 1.3B / 14B ~13B 2B / 5B
Default Resolution 480x832 720x1280 Model-dependent
Default Frames 81 129 49
Scheduler FlowMatch / UniPC FlowMatch DDIM / DPM
Text Encoder UMT5-XXL Llama + CLIP T5
Guidance Type Classifier-free Embedded + optional true CFG Classifier-free

Usage

Use this principle when beginning a video generation workflow. The pipeline choice determines all downstream decisions:

  1. Identify the target quality and resolution requirements
  2. Select the architecture family (Wan for multilingual, HunyuanVideo for high-resolution, CogVideoX for resource-constrained)
  3. Use from_pretrained on the corresponding pipeline class to load all components (transformer, VAE, text encoder, scheduler)
  4. Configure the scheduler to match the architecture (FlowMatch for Wan/HunyuanVideo, DDIM/DPM for CogVideoX)

Related Pages

Implementation:Huggingface_Diffusers_Video_Pipeline_From_Pretrained

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment