Principle:Zai org CogVideo DDIM Inversion
| Attribute | Value |
|---|---|
| Principle Name | DDIM Inversion |
| Workflow | Video Editing DDIM Inversion |
| Step | 4 of 6 |
| Type | Core Algorithm |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for finding the noise-space representation of a video by reversing the DDIM denoising process. DDIM inversion maps clean video latents to their corresponding initial noise tensors, enabling structure-preserving video editing.
Description
DDIM inversion runs the denoising process in reverse: starting from clean video latents and progressively adding noise to find the initial noise tensor that would reconstruct the video. The process consists of:
- Unconditional inversion: The inversion is performed with an empty prompt (unconditional) using the
DDIMInverseScheduler. This ensures the inversion captures only the structural information of the video, not any text-specific features. - Trajectory storage: The inversion stores latents at each timestep, creating a trajectory
{x_0, x_1, ..., x_T}. This trajectory is essential for attention injection during the subsequent reconstruction step. - Noise-space mapping: The final latent
x_Trepresents the noise that, when denoised with the original prompt, should approximately reconstruct the source video.
The trajectory encodes the structural information of the source video at each noise level, which is later used to guide the editing process.
Usage
Use DDIM Inversion after encoding video frames to latent space, and before prompted reconstruction. The inversion trajectory is passed as reference latents to the reconstruction step.
Theoretical Basis
DDIM is deterministic, so the reverse process x_0 -> x_T is well-defined:
Inverse DDIM step:
x_{t+1} = sqrt(alpha_{t+1}) * f_theta(x_t, t) + sqrt(1 - alpha_{t+1}) * epsilon_theta(x_t, t)
where:
f_theta(x_t, t)is the predicted clean signal (x_0 prediction)epsilon_theta(x_t, t)is the predicted noisealpha_tis the noise schedule parameter at timestept
The trajectory {x_0, x_1, ..., x_T} encodes the structural information of the source video at each noise level. During reconstruction with a new prompt:
- At timestep
t, the reference latentx_tfrom inversion provides spatial layout information - The new prompt provides semantic guidance for the edited content
- Attention injection blends these two sources of information
The unconditional inversion (empty prompt) is preferred because:
- It captures pure structural information without text bias
- The inversion is more stable without classifier-free guidance
- Text-specific features are only introduced during reconstruction
Related Pages
- Implementation:Zai_org_CogVideo_DDIM_Inversion_Sample -- Implementation of the DDIM inversion sampling function
- Zai_org_CogVideo_Video_Encoding -- Previous step: encoding video frames to latent space
- Zai_org_CogVideo_Prompted_Reconstruction -- Next step: reconstruction with attention injection
- Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline loading that provides the inverse scheduler