Principle:Zai org CogVideo Diffusion Sampling

Attribute	Value
Principle Name	Diffusion Sampling
Workflow	SAT Video Generation
Step	4 of 5
Type	Core Algorithm
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for generating video latents from noise through iterative denoising using the EulerEDM sampler with classifier-free guidance. This is the core generation step of the SAT video pipeline, transforming random Gaussian noise into coherent video latent representations conditioned on text (and optionally image) inputs.

Description

Diffusion sampling starts from random Gaussian noise and iteratively denoises it using the trained transformer model. The process consists of several key components:

Noise initialization: A random Gaussian tensor of shape (B, T, C, H//F, W//F) is sampled, where F=8 is the VAE downsampling factor.
EulerEDM sampler: Uses an Euler discretization of the probability flow ODE to step from noise toward clean latents. The sampler follows a predefined noise schedule with decreasing noise levels.
Classifier-free guidance (CFG): At each denoising step, the model runs both conditional (with text embedding) and unconditional (with null embedding) forward passes. The guided prediction amplifies the difference between them.
DynamicCFG (optional): Varies the guidance scale during sampling, typically using higher guidance early in the process and lower guidance later, for improved quality and diversity.
Image-to-video (I2V): For I2V mode, image latents are concatenated with the noise tensor before each denoising step, and an offset embedding ofs is provided to control temporal dynamics.

Usage

Use Diffusion Sampling after model loading and prompt encoding, and before VAE decoding. The sampling step is the most computationally intensive part of the pipeline. Shape parameters must be derived from the configured resolution and frame count.

Theoretical Basis

The theoretical foundation rests on the probability flow ODE formulation of diffusion models:

Probability flow ODE:

dx = [f(x,t) - g^2(t) * nabla_log p_t(x)] dt

Euler discretization:

x_{t-1} = x_t + delta_t * denoiser(x_t, t, c)

where denoiser(x_t, t, c) is the trained transformer model that predicts the clean signal or noise given the current noisy latent, timestep, and conditioning.

Classifier-free guidance:

epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)

This formulation amplifies the effect of conditioning by a factor of scale (typically 6.0-7.5), steering the generation more strongly toward the text prompt at the cost of reduced diversity.

DynamicCFG varies the guidance scale scale(t) during sampling. A common schedule uses higher guidance in early (noisy) steps for structural coherence and lower guidance in later (clean) steps for detail quality.

For I2V generation, the image latents are concatenated along the channel dimension before the transformer forward pass, providing spatial conditioning that anchors the first frame of the generated video.

Related Pages

Implementation:Zai_org_CogVideo_SAT_Diffusion_Sample -- Implementation of the sampling function
Zai_org_CogVideo_SAT_Prompt_Input -- Previous step: prompt input providing text conditioning
Zai_org_CogVideo_SAT_Video_Decoding_and_Export -- Next step: decoding latents to pixel-space video
Zai_org_CogVideo_SAT_Model_Loading_for_Inference -- Model loaded for sampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment