Principle:Huggingface Diffusers Video Pipeline Selection
| Property | Value |
|---|---|
| Principle Name | Video Pipeline Selection |
| Overview | Selecting the appropriate video generation pipeline based on model architecture, balancing quality, speed, and resource constraints |
| Domains | Video Generation, Diffusion Models |
| Related Implementation | Huggingface_Diffusers_Video_Pipeline_From_Pretrained |
| Knowledge Sources | Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/pipelines/wan/pipeline_wan.py, src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py, src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py)
|
| Last Updated | 2026-02-13 00:00 GMT |
Description
Video generation in Diffusers revolves around selecting a pipeline class matched to the underlying model architecture. Each pipeline encapsulates a complete text-to-video (or image-to-video) generation workflow including text encoding, latent preparation, iterative denoising, and video decoding. The key architectural families are:
- Wan (
WanPipeline) - Uses a WanTransformer3DModel with 3D patch embeddings, rotary position embeddings, and a UMT5 text encoder. Supports 1.3B and 14B parameter variants, two-stage denoising via optionaltransformer_2, and FlowMatchEulerDiscreteScheduler. - HunyuanVideo (
HunyuanVideoPipeline) - Uses a HunyuanVideoTransformer3DModel with dual text encoders (Llama + CLIP), embedded guidance scale, and FlowMatchEulerDiscreteScheduler. Generates 720p video at 129 frames. - CogVideoX (
CogVideoXPipeline) - Uses a CogVideoXTransformer3DModel with T5 text encoder, 3D rotary positional embeddings, and CogVideoX-specific schedulers (DDIM/DPM). Supports 2B and 5B variants.
Theoretical Basis
Video Diffusion Architectures
Video diffusion models extend image diffusion by operating on 5D latent tensors with shape (B, C, F, H, W) where F is the number of latent frames. The core differences between architectures lie in:
- 3D Attention Mechanisms - All video transformers process spatial and temporal dimensions jointly. Wan uses factored rotary embeddings across (t, h, w) dimensions. CogVideoX uses 3D rotary position embeddings with crop-aware grid computation. HunyuanVideo uses full 3D attention with embedded guidance.
- Text Conditioning Strategy - Wan uses UMT5-XXL for text encoding with cross-attention. HunyuanVideo uses a dual-encoder approach (Llama for rich semantics + CLIP for pooled embeddings). CogVideoX uses T5 with classifier-free guidance via negative prompt concatenation.
- Temporal Compression - Each VAE has a different temporal compression ratio. Wan uses
scale_factor_temporal=4(81 frames -> 21 latent frames). HunyuanVideo and CogVideoX also use temporal compression ratio 4.
- Guidance Mechanism - Wan and CogVideoX use classifier-free guidance (two forward passes per step). HunyuanVideo uses embedded guidance (a single forward pass with a guidance embedding), plus optional true CFG via
true_cfg_scale.
Pipeline Selection Criteria
| Criterion | Wan | HunyuanVideo | CogVideoX |
|---|---|---|---|
| Parameters | 1.3B / 14B | ~13B | 2B / 5B |
| Default Resolution | 480x832 | 720x1280 | Model-dependent |
| Default Frames | 81 | 129 | 49 |
| Scheduler | FlowMatch / UniPC | FlowMatch | DDIM / DPM |
| Text Encoder | UMT5-XXL | Llama + CLIP | T5 |
| Guidance Type | Classifier-free | Embedded + optional true CFG | Classifier-free |
Usage
Use this principle when beginning a video generation workflow. The pipeline choice determines all downstream decisions:
- Identify the target quality and resolution requirements
- Select the architecture family (Wan for multilingual, HunyuanVideo for high-resolution, CogVideoX for resource-constrained)
- Use
from_pretrainedon the corresponding pipeline class to load all components (transformer, VAE, text encoder, scheduler) - Configure the scheduler to match the architecture (FlowMatch for Wan/HunyuanVideo, DDIM/DPM for CogVideoX)
Related Pages
- Huggingface_Diffusers_Video_Pipeline_From_Pretrained (implements this principle) - Concrete API calls to instantiate video pipelines
- Huggingface_Diffusers_Video_Memory_Management (next step) - Memory optimization after pipeline selection
- Huggingface_Diffusers_Video_Input_Preparation (next step) - Preparing inputs for the selected pipeline
- Huggingface_Diffusers_Video_Denoising (core step) - The denoising process inside the pipeline
- Huggingface_Diffusers_Video_Decoding_Export (final step) - Decoding latents and exporting video
Implementation:Huggingface_Diffusers_Video_Pipeline_From_Pretrained