Principle:Zai org CogVideo DDIM Pipeline Loading
| Attribute | Value |
|---|---|
| Principle Name | DDIM Pipeline Loading |
| Workflow | Video Editing DDIM Inversion |
| Step | 2 of 6 |
| Type | Model Initialization |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for loading the CogVideoX pipeline with DDIM-specific schedulers for video inversion and editing. DDIM inversion requires both forward and inverse schedulers, and the CogVideoX-5B variant specifically (due to its rotary positional embedding support).
Description
DDIM inversion requires loading the CogVideoX pipeline with two schedulers:
- CogVideoXDDIMScheduler (forward): Used during reconstruction to denoise from inverted noise back to clean video. This scheduler implements the deterministic DDIM forward process.
- DDIMInverseScheduler (inverse): Used during inversion to map clean video latents to their noise-space representations. This scheduler reverses the forward DDIM steps.
The pipeline is loaded to GPU directly (no CPU offloading) since both forward and inverse passes are needed in the same session, and CPU offloading would introduce excessive data transfer overhead.
Important constraint: Only the CogVideoX-5B variant is supported for DDIM inversion because it uses rotary positional embeddings, which are required for the inversion process to produce faithful reconstructions. The 2B variant does not support this.
Usage
Use DDIM Pipeline Loading at the beginning of the video editing workflow, before video encoding and inversion. The loaded pipeline provides the VAE (for encoding/decoding), transformer (for denoising), text encoder (for prompt conditioning), and schedulers (for forward/inverse DDIM).
Theoretical Basis
DDIM inversion requires a deterministic (non-stochastic) scheduler to ensure invertibility. The standard DDPM scheduler introduces random noise at each step, making the process non-invertible. The DDIM scheduler removes this stochasticity by using a deterministic mapping:
Forward DDIM step:
x_{t-1} = sqrt(alpha_{t-1}) * x_0_pred + sqrt(1 - alpha_{t-1}) * epsilon_pred
Inverse DDIM step (reversing the above):
x_{t+1} = sqrt(alpha_{t+1}) * x_0_pred + sqrt(1 - alpha_{t+1}) * epsilon_pred
The DDIMInverseScheduler reverses the forward DDIM process, mapping clean latents to their noise-space representations. The deterministic nature of DDIM ensures that forward(inverse(x)) approximately equals x, enabling faithful reconstruction.
Rotary positional embeddings (RoPE) in the 5B model provide position-dependent attention that is critical for maintaining temporal coherence during the inversion-reconstruction cycle.
Related Pages
- Implementation:Zai_org_CogVideo_DDIM_CogVideoXPipeline_From_Pretrained -- Implementation of pipeline loading
- Zai_org_CogVideo_Video_Loading_and_Preprocessing -- Previous step: video preprocessing
- Zai_org_CogVideo_Video_Encoding -- Next step: encoding video frames using the pipeline's VAE
- Zai_org_CogVideo_DDIM_Inversion -- Inversion step that uses the inverse scheduler