Principle:Huggingface Diffusers Video Pipeline Selection

Property	Value
Principle Name	Video Pipeline Selection
Overview	Selecting the appropriate video generation pipeline based on model architecture, balancing quality, speed, and resource constraints
Domains	Video Generation, Diffusion Models
Related Implementation	Huggingface_Diffusers_Video_Pipeline_From_Pretrained
Knowledge Sources	Repo (https://github.com/huggingface/diffusers), Source (`src/diffusers/pipelines/wan/pipeline_wan.py`, `src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py`, `src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py`)
Last Updated	2026-02-13 00:00 GMT

Description

Video generation in Diffusers revolves around selecting a pipeline class matched to the underlying model architecture. Each pipeline encapsulates a complete text-to-video (or image-to-video) generation workflow including text encoding, latent preparation, iterative denoising, and video decoding. The key architectural families are:

Wan (WanPipeline) - Uses a WanTransformer3DModel with 3D patch embeddings, rotary position embeddings, and a UMT5 text encoder. Supports 1.3B and 14B parameter variants, two-stage denoising via optional transformer_2, and FlowMatchEulerDiscreteScheduler.
HunyuanVideo (HunyuanVideoPipeline) - Uses a HunyuanVideoTransformer3DModel with dual text encoders (Llama + CLIP), embedded guidance scale, and FlowMatchEulerDiscreteScheduler. Generates 720p video at 129 frames.
CogVideoX (CogVideoXPipeline) - Uses a CogVideoXTransformer3DModel with T5 text encoder, 3D rotary positional embeddings, and CogVideoX-specific schedulers (DDIM/DPM). Supports 2B and 5B variants.

Theoretical Basis

Video Diffusion Architectures

Video diffusion models extend image diffusion by operating on 5D latent tensors with shape (B, C, F, H, W) where F is the number of latent frames. The core differences between architectures lie in:

3D Attention Mechanisms - All video transformers process spatial and temporal dimensions jointly. Wan uses factored rotary embeddings across (t, h, w) dimensions. CogVideoX uses 3D rotary position embeddings with crop-aware grid computation. HunyuanVideo uses full 3D attention with embedded guidance.

Text Conditioning Strategy - Wan uses UMT5-XXL for text encoding with cross-attention. HunyuanVideo uses a dual-encoder approach (Llama for rich semantics + CLIP for pooled embeddings). CogVideoX uses T5 with classifier-free guidance via negative prompt concatenation.

Temporal Compression - Each VAE has a different temporal compression ratio. Wan uses scale_factor_temporal=4 (81 frames -> 21 latent frames). HunyuanVideo and CogVideoX also use temporal compression ratio 4.

Guidance Mechanism - Wan and CogVideoX use classifier-free guidance (two forward passes per step). HunyuanVideo uses embedded guidance (a single forward pass with a guidance embedding), plus optional true CFG via true_cfg_scale.

Pipeline Selection Criteria

Criterion	Wan	HunyuanVideo	CogVideoX
Parameters	1.3B / 14B	~13B	2B / 5B
Default Resolution	480x832	720x1280	Model-dependent
Default Frames	81	129	49
Scheduler	FlowMatch / UniPC	FlowMatch	DDIM / DPM
Text Encoder	UMT5-XXL	Llama + CLIP	T5
Guidance Type	Classifier-free	Embedded + optional true CFG	Classifier-free

Usage

Use this principle when beginning a video generation workflow. The pipeline choice determines all downstream decisions:

Identify the target quality and resolution requirements
Select the architecture family (Wan for multilingual, HunyuanVideo for high-resolution, CogVideoX for resource-constrained)
Use from_pretrained on the corresponding pipeline class to load all components (transformer, VAE, text encoder, scheduler)
Configure the scheduler to match the architecture (FlowMatch for Wan/HunyuanVideo, DDIM/DPM for CogVideoX)

Related Pages

Huggingface_Diffusers_Video_Pipeline_From_Pretrained (implements this principle) - Concrete API calls to instantiate video pipelines
Huggingface_Diffusers_Video_Memory_Management (next step) - Memory optimization after pipeline selection
Huggingface_Diffusers_Video_Input_Preparation (next step) - Preparing inputs for the selected pipeline
Huggingface_Diffusers_Video_Denoising (core step) - The denoising process inside the pipeline
Huggingface_Diffusers_Video_Decoding_Export (final step) - Decoding latents and exporting video

Implementation:Huggingface_Diffusers_Video_Pipeline_From_Pretrained

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment