Principle:Huggingface Diffusers Video Denoising

Property	Value
Principle Name	Video Denoising
Overview	The 3D denoising process for generating temporally coherent video frames using transformer-based architectures
Domains	Video Generation, Diffusion Models, 3D Transformers
Related Implementation	Huggingface_Diffusers_WanTransformer3DModel_Forward
Knowledge Sources	Repo (https://github.com/huggingface/diffusers), Source (`src/diffusers/models/transformers/transformer_wan.py:L289-L430`)
Last Updated	2026-02-13 00:00 GMT

Description

Video denoising is the core iterative process that transforms random noise into coherent video frames. Unlike image diffusion which operates on 4D tensors, video diffusion uses 3D transformers that jointly attend to spatial and temporal dimensions, ensuring temporal coherence between frames.

The denoising loop executes N steps (typically 30-50), each performing:

Prepare the noisy latent as model input
Compute the noise prediction via the transformer's forward pass
Apply classifier-free guidance (if enabled)
Step the scheduler to compute the less-noisy latent

Theoretical Basis

3D Transformer Architecture (Wan)

The WanTransformer3DModel processes video latents through four stages:

1. Patch Embedding: The 5D input (B, C, F, H, W) is converted to a sequence of patch tokens via a 3D convolution with kernel and stride equal to patch_size = (1, 2, 2). This produces (B, inner_dim, F, H/2, W/2) which is flattened to (B, seq_len, inner_dim).

2. Condition Embedding: Three conditioning signals are computed:

Timestep embedding (temb) - Sinusoidal encoding of the diffusion timestep, projected through an MLP to produce adaptive normalization parameters (6 scale/shift/gate values per block)
Text embedding - UMT5 encoder hidden states projected via PixArtAlphaTextProjection
Image embedding (optional, for I2V) - CLIP image features projected through WanImageEmbedding

3. Transformer Blocks: Each of the N blocks (40 for 14B) applies:

Self-attention with rotary position embeddings (RoPE) factored across (t, h, w) dimensions
Cross-attention to text (and optional image) embeddings
Feed-forward network with GELU-approximate activation
Adaptive normalization (AdaLN) using timestep-derived scale, shift, and gate parameters

4. Output Projection: Final layer norm with adaptive modulation, followed by linear projection and unpatchifying back to (B, C, F, H, W).

Rotary Position Embeddings

The WanRotaryPosEmbed module creates factored 3D position encodings:

Time dimension gets t_dim channels
Height dimension gets h_dim channels
Width dimension gets w_dim channels

Where h_dim = w_dim = 2 * (attention_head_dim // 6) and t_dim = attention_head_dim - h_dim - w_dim. These are combined as cat([freqs_f, freqs_h, freqs_w]) and applied to query and key vectors in self-attention.

Classifier-Free Guidance

For Wan pipelines, guidance requires two forward passes per step:

# Conditional pass
noise_pred = transformer(latent, timestep, prompt_embeds)
# Unconditional pass
noise_uncond = transformer(latent, timestep, negative_prompt_embeds)
# Guided prediction
noise_pred = noise_uncond + guidance_scale * (noise_pred - noise_uncond)

HunyuanVideo uses embedded guidance where the guidance scale is passed as an input embedding, requiring only one forward pass (unless true_cfg_scale > 1).

Two-Stage Denoising (Wan 2.2)

Wan supports optional two-stage denoising with transformer_2 and boundary_ratio:

High-noise timesteps (>= boundary) use transformer (larger model)
Low-noise timesteps (< boundary) use transformer_2 (potentially smaller model)

Usage

The denoising process is invoked automatically when calling the pipeline. Key parameters that affect denoising:

num_inference_steps (default 50) - More steps improve quality but increase latency linearly
guidance_scale (default 5.0) - Higher values increase text fidelity but reduce diversity
num_frames - Affects sequence length and thus attention memory quadratically
height / width - Affect patch count and attention memory

Related Pages

Huggingface_Diffusers_WanTransformer3DModel_Forward (implements this principle) - Concrete forward pass API
Huggingface_Diffusers_Video_Pipeline_Selection (determines architecture) - Pipeline selection determines which transformer is used
Huggingface_Diffusers_Video_Memory_Management (optimizes this) - CPU offloading reduces memory during denoising
Huggingface_Diffusers_Video_Input_Preparation (provides inputs) - Initial noise tensor preparation
Huggingface_Diffusers_Video_Decoding_Export (next step) - Decoding the final denoised latents

Implementation:Huggingface_Diffusers_WanTransformer3DModel_Forward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment