Principle:Huggingface Diffusers Video Denoising
| Property | Value |
|---|---|
| Principle Name | Video Denoising |
| Overview | The 3D denoising process for generating temporally coherent video frames using transformer-based architectures |
| Domains | Video Generation, Diffusion Models, 3D Transformers |
| Related Implementation | Huggingface_Diffusers_WanTransformer3DModel_Forward |
| Knowledge Sources | Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/models/transformers/transformer_wan.py:L289-L430)
|
| Last Updated | 2026-02-13 00:00 GMT |
Description
Video denoising is the core iterative process that transforms random noise into coherent video frames. Unlike image diffusion which operates on 4D tensors, video diffusion uses 3D transformers that jointly attend to spatial and temporal dimensions, ensuring temporal coherence between frames.
The denoising loop executes N steps (typically 30-50), each performing:
- Prepare the noisy latent as model input
- Compute the noise prediction via the transformer's forward pass
- Apply classifier-free guidance (if enabled)
- Step the scheduler to compute the less-noisy latent
Theoretical Basis
3D Transformer Architecture (Wan)
The WanTransformer3DModel processes video latents through four stages:
1. Patch Embedding: The 5D input (B, C, F, H, W) is converted to a sequence of patch tokens via a 3D convolution with kernel and stride equal to patch_size = (1, 2, 2). This produces (B, inner_dim, F, H/2, W/2) which is flattened to (B, seq_len, inner_dim).
2. Condition Embedding: Three conditioning signals are computed:
- Timestep embedding (
temb) - Sinusoidal encoding of the diffusion timestep, projected through an MLP to produce adaptive normalization parameters (6 scale/shift/gate values per block) - Text embedding - UMT5 encoder hidden states projected via
PixArtAlphaTextProjection - Image embedding (optional, for I2V) - CLIP image features projected through
WanImageEmbedding
3. Transformer Blocks: Each of the N blocks (40 for 14B) applies:
- Self-attention with rotary position embeddings (RoPE) factored across (t, h, w) dimensions
- Cross-attention to text (and optional image) embeddings
- Feed-forward network with GELU-approximate activation
- Adaptive normalization (AdaLN) using timestep-derived scale, shift, and gate parameters
4. Output Projection: Final layer norm with adaptive modulation, followed by linear projection and unpatchifying back to (B, C, F, H, W).
Rotary Position Embeddings
The WanRotaryPosEmbed module creates factored 3D position encodings:
- Time dimension gets
t_dimchannels - Height dimension gets
h_dimchannels - Width dimension gets
w_dimchannels
Where h_dim = w_dim = 2 * (attention_head_dim // 6) and t_dim = attention_head_dim - h_dim - w_dim. These are combined as cat([freqs_f, freqs_h, freqs_w]) and applied to query and key vectors in self-attention.
Classifier-Free Guidance
For Wan pipelines, guidance requires two forward passes per step:
# Conditional pass
noise_pred = transformer(latent, timestep, prompt_embeds)
# Unconditional pass
noise_uncond = transformer(latent, timestep, negative_prompt_embeds)
# Guided prediction
noise_pred = noise_uncond + guidance_scale * (noise_pred - noise_uncond)
HunyuanVideo uses embedded guidance where the guidance scale is passed as an input embedding, requiring only one forward pass (unless true_cfg_scale > 1).
Two-Stage Denoising (Wan 2.2)
Wan supports optional two-stage denoising with transformer_2 and boundary_ratio:
- High-noise timesteps (>= boundary) use
transformer(larger model) - Low-noise timesteps (< boundary) use
transformer_2(potentially smaller model)
Usage
The denoising process is invoked automatically when calling the pipeline. Key parameters that affect denoising:
num_inference_steps(default 50) - More steps improve quality but increase latency linearlyguidance_scale(default 5.0) - Higher values increase text fidelity but reduce diversitynum_frames- Affects sequence length and thus attention memory quadraticallyheight/width- Affect patch count and attention memory
Related Pages
- Huggingface_Diffusers_WanTransformer3DModel_Forward (implements this principle) - Concrete forward pass API
- Huggingface_Diffusers_Video_Pipeline_Selection (determines architecture) - Pipeline selection determines which transformer is used
- Huggingface_Diffusers_Video_Memory_Management (optimizes this) - CPU offloading reduces memory during denoising
- Huggingface_Diffusers_Video_Input_Preparation (provides inputs) - Initial noise tensor preparation
- Huggingface_Diffusers_Video_Decoding_Export (next step) - Decoding the final denoised latents
Implementation:Huggingface_Diffusers_WanTransformer3DModel_Forward