Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Diffusion Sampling

From Leeroopedia


Attribute Value
Principle Name Diffusion Sampling
Workflow SAT Video Generation
Step 4 of 5
Type Core Algorithm
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for generating video latents from noise through iterative denoising using the EulerEDM sampler with classifier-free guidance. This is the core generation step of the SAT video pipeline, transforming random Gaussian noise into coherent video latent representations conditioned on text (and optionally image) inputs.

Description

Diffusion sampling starts from random Gaussian noise and iteratively denoises it using the trained transformer model. The process consists of several key components:

  1. Noise initialization: A random Gaussian tensor of shape (B, T, C, H//F, W//F) is sampled, where F=8 is the VAE downsampling factor.
  2. EulerEDM sampler: Uses an Euler discretization of the probability flow ODE to step from noise toward clean latents. The sampler follows a predefined noise schedule with decreasing noise levels.
  3. Classifier-free guidance (CFG): At each denoising step, the model runs both conditional (with text embedding) and unconditional (with null embedding) forward passes. The guided prediction amplifies the difference between them.
  4. DynamicCFG (optional): Varies the guidance scale during sampling, typically using higher guidance early in the process and lower guidance later, for improved quality and diversity.
  5. Image-to-video (I2V): For I2V mode, image latents are concatenated with the noise tensor before each denoising step, and an offset embedding ofs is provided to control temporal dynamics.

Usage

Use Diffusion Sampling after model loading and prompt encoding, and before VAE decoding. The sampling step is the most computationally intensive part of the pipeline. Shape parameters must be derived from the configured resolution and frame count.

Theoretical Basis

The theoretical foundation rests on the probability flow ODE formulation of diffusion models:

Probability flow ODE:

dx = [f(x,t) - g^2(t) * nabla_log p_t(x)] dt

Euler discretization:

x_{t-1} = x_t + delta_t * denoiser(x_t, t, c)

where denoiser(x_t, t, c) is the trained transformer model that predicts the clean signal or noise given the current noisy latent, timestep, and conditioning.

Classifier-free guidance:

epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)

This formulation amplifies the effect of conditioning by a factor of scale (typically 6.0-7.5), steering the generation more strongly toward the text prompt at the cost of reduced diversity.

DynamicCFG varies the guidance scale scale(t) during sampling. A common schedule uses higher guidance in early (noisy) steps for structural coherence and lower guidance in later (clean) steps for detail quality.

For I2V generation, the image latents are concatenated along the channel dimension before the transformer forward pass, providing spatial conditioning that anchors the first frame of the generated video.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment