Principle:Zai org CogVideo Scheduler Configuration
Overview
Technique for selecting and configuring the noise scheduler that controls the diffusion sampling process during video generation.
Description
The scheduler defines the noise schedule and step function used during the denoising (sampling) process. CogVideoX supports DPM and DDIM schedulers. DPM (Diffusion Probabilistic Model) scheduler is recommended for 5B models for better quality, while DDIM is recommended for 2B models. The "trailing" timestep spacing strategy is used for proper alignment with the training noise schedule.
The scheduler is responsible for:
- Defining the noise schedule -- The sequence of noise levels from pure noise to clean signal
- Computing the step function -- How to update latents at each denoising step given the model prediction
- Managing timesteps -- Selecting which timesteps to use during the sampling process
Usage
Use after loading the pipeline and before generating videos. The scheduler choice depends on the model variant:
| Model Variant | Recommended Scheduler | Reasoning |
|---|---|---|
| CogVideoX-5b / CogVideoX1.5-5B | CogVideoXDPMScheduler | Better quality with higher-order ODE solver |
| CogVideoX-2b | CogVideoXDDIMScheduler | Better compatibility with 2B model training |
Both schedulers should use timestep_spacing="trailing" for proper alignment with the training noise schedule.
Theoretical Basis
DPM-Solver
DPM-Solver uses higher-order ODE solvers for faster convergence with fewer steps. It formulates the reverse diffusion process as solving an ordinary differential equation (ODE) and applies multi-step methods to achieve higher accuracy per step compared to first-order methods.
DDIM
DDIM (Denoising Diffusion Implicit Models) uses a deterministic reverse process. Given the noise prediction at each step, DDIM computes the denoised sample using a non-Markovian update rule that allows skipping steps while maintaining sample quality.
Trailing Timestep Spacing
Trailing timestep spacing aligns the inference schedule with training by placing the final timestep at the end of the noise range. This ensures that the last denoising step produces a fully denoised sample, which is critical for generation quality when using fewer inference steps than training steps.