Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Diffusers Video Denoising

From Leeroopedia
Property Value
Principle Name Video Denoising
Overview The 3D denoising process for generating temporally coherent video frames using transformer-based architectures
Domains Video Generation, Diffusion Models, 3D Transformers
Related Implementation Huggingface_Diffusers_WanTransformer3DModel_Forward
Knowledge Sources Repo (https://github.com/huggingface/diffusers), Source (src/diffusers/models/transformers/transformer_wan.py:L289-L430)
Last Updated 2026-02-13 00:00 GMT

Description

Video denoising is the core iterative process that transforms random noise into coherent video frames. Unlike image diffusion which operates on 4D tensors, video diffusion uses 3D transformers that jointly attend to spatial and temporal dimensions, ensuring temporal coherence between frames.

The denoising loop executes N steps (typically 30-50), each performing:

  1. Prepare the noisy latent as model input
  2. Compute the noise prediction via the transformer's forward pass
  3. Apply classifier-free guidance (if enabled)
  4. Step the scheduler to compute the less-noisy latent

Theoretical Basis

3D Transformer Architecture (Wan)

The WanTransformer3DModel processes video latents through four stages:

1. Patch Embedding: The 5D input (B, C, F, H, W) is converted to a sequence of patch tokens via a 3D convolution with kernel and stride equal to patch_size = (1, 2, 2). This produces (B, inner_dim, F, H/2, W/2) which is flattened to (B, seq_len, inner_dim).

2. Condition Embedding: Three conditioning signals are computed:

  • Timestep embedding (temb) - Sinusoidal encoding of the diffusion timestep, projected through an MLP to produce adaptive normalization parameters (6 scale/shift/gate values per block)
  • Text embedding - UMT5 encoder hidden states projected via PixArtAlphaTextProjection
  • Image embedding (optional, for I2V) - CLIP image features projected through WanImageEmbedding

3. Transformer Blocks: Each of the N blocks (40 for 14B) applies:

  • Self-attention with rotary position embeddings (RoPE) factored across (t, h, w) dimensions
  • Cross-attention to text (and optional image) embeddings
  • Feed-forward network with GELU-approximate activation
  • Adaptive normalization (AdaLN) using timestep-derived scale, shift, and gate parameters

4. Output Projection: Final layer norm with adaptive modulation, followed by linear projection and unpatchifying back to (B, C, F, H, W).

Rotary Position Embeddings

The WanRotaryPosEmbed module creates factored 3D position encodings:

  • Time dimension gets t_dim channels
  • Height dimension gets h_dim channels
  • Width dimension gets w_dim channels

Where h_dim = w_dim = 2 * (attention_head_dim // 6) and t_dim = attention_head_dim - h_dim - w_dim. These are combined as cat([freqs_f, freqs_h, freqs_w]) and applied to query and key vectors in self-attention.

Classifier-Free Guidance

For Wan pipelines, guidance requires two forward passes per step:

# Conditional pass
noise_pred = transformer(latent, timestep, prompt_embeds)
# Unconditional pass
noise_uncond = transformer(latent, timestep, negative_prompt_embeds)
# Guided prediction
noise_pred = noise_uncond + guidance_scale * (noise_pred - noise_uncond)

HunyuanVideo uses embedded guidance where the guidance scale is passed as an input embedding, requiring only one forward pass (unless true_cfg_scale > 1).

Two-Stage Denoising (Wan 2.2)

Wan supports optional two-stage denoising with transformer_2 and boundary_ratio:

  • High-noise timesteps (>= boundary) use transformer (larger model)
  • Low-noise timesteps (< boundary) use transformer_2 (potentially smaller model)

Usage

The denoising process is invoked automatically when calling the pipeline. Key parameters that affect denoising:

  1. num_inference_steps (default 50) - More steps improve quality but increase latency linearly
  2. guidance_scale (default 5.0) - Higher values increase text fidelity but reduce diversity
  3. num_frames - Affects sequence length and thus attention memory quadratically
  4. height / width - Affect patch count and attention memory

Related Pages

Implementation:Huggingface_Diffusers_WanTransformer3DModel_Forward

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment