Principle:Zai org CogVideo Diffusion Sampling
| Attribute | Value |
|---|---|
| Principle Name | Diffusion Sampling |
| Workflow | SAT Video Generation |
| Step | 4 of 5 |
| Type | Core Algorithm |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for generating video latents from noise through iterative denoising using the EulerEDM sampler with classifier-free guidance. This is the core generation step of the SAT video pipeline, transforming random Gaussian noise into coherent video latent representations conditioned on text (and optionally image) inputs.
Description
Diffusion sampling starts from random Gaussian noise and iteratively denoises it using the trained transformer model. The process consists of several key components:
- Noise initialization: A random Gaussian tensor of shape
(B, T, C, H//F, W//F)is sampled, whereF=8is the VAE downsampling factor. - EulerEDM sampler: Uses an Euler discretization of the probability flow ODE to step from noise toward clean latents. The sampler follows a predefined noise schedule with decreasing noise levels.
- Classifier-free guidance (CFG): At each denoising step, the model runs both conditional (with text embedding) and unconditional (with null embedding) forward passes. The guided prediction amplifies the difference between them.
- DynamicCFG (optional): Varies the guidance scale during sampling, typically using higher guidance early in the process and lower guidance later, for improved quality and diversity.
- Image-to-video (I2V): For I2V mode, image latents are concatenated with the noise tensor before each denoising step, and an offset embedding
ofsis provided to control temporal dynamics.
Usage
Use Diffusion Sampling after model loading and prompt encoding, and before VAE decoding. The sampling step is the most computationally intensive part of the pipeline. Shape parameters must be derived from the configured resolution and frame count.
Theoretical Basis
The theoretical foundation rests on the probability flow ODE formulation of diffusion models:
Probability flow ODE:
dx = [f(x,t) - g^2(t) * nabla_log p_t(x)] dt
Euler discretization:
x_{t-1} = x_t + delta_t * denoiser(x_t, t, c)
where denoiser(x_t, t, c) is the trained transformer model that predicts the clean signal or noise given the current noisy latent, timestep, and conditioning.
Classifier-free guidance:
epsilon_guided = epsilon_uncond + scale * (epsilon_cond - epsilon_uncond)
This formulation amplifies the effect of conditioning by a factor of scale (typically 6.0-7.5), steering the generation more strongly toward the text prompt at the cost of reduced diversity.
DynamicCFG varies the guidance scale scale(t) during sampling. A common schedule uses higher guidance in early (noisy) steps for structural coherence and lower guidance in later (clean) steps for detail quality.
For I2V generation, the image latents are concatenated along the channel dimension before the transformer forward pass, providing spatial conditioning that anchors the first frame of the generated video.
Related Pages
- Implementation:Zai_org_CogVideo_SAT_Diffusion_Sample -- Implementation of the sampling function
- Zai_org_CogVideo_SAT_Prompt_Input -- Previous step: prompt input providing text conditioning
- Zai_org_CogVideo_SAT_Video_Decoding_and_Export -- Next step: decoding latents to pixel-space video
- Zai_org_CogVideo_SAT_Model_Loading_for_Inference -- Model loaded for sampling