Principle:Zai org CogVideo DDIM Inversion

Attribute	Value
Principle Name	DDIM Inversion
Workflow	Video Editing DDIM Inversion
Step	4 of 6
Type	Core Algorithm
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for finding the noise-space representation of a video by reversing the DDIM denoising process. DDIM inversion maps clean video latents to their corresponding initial noise tensors, enabling structure-preserving video editing.

Description

DDIM inversion runs the denoising process in reverse: starting from clean video latents and progressively adding noise to find the initial noise tensor that would reconstruct the video. The process consists of:

Unconditional inversion: The inversion is performed with an empty prompt (unconditional) using the DDIMInverseScheduler. This ensures the inversion captures only the structural information of the video, not any text-specific features.
Trajectory storage: The inversion stores latents at each timestep, creating a trajectory {x_0, x_1, ..., x_T}. This trajectory is essential for attention injection during the subsequent reconstruction step.
Noise-space mapping: The final latent x_T represents the noise that, when denoised with the original prompt, should approximately reconstruct the source video.

The trajectory encodes the structural information of the source video at each noise level, which is later used to guide the editing process.

Usage

Use DDIM Inversion after encoding video frames to latent space, and before prompted reconstruction. The inversion trajectory is passed as reference latents to the reconstruction step.

Theoretical Basis

DDIM is deterministic, so the reverse process x_0 -> x_T is well-defined:

Inverse DDIM step:

x_{t+1} = sqrt(alpha_{t+1}) * f_theta(x_t, t) + sqrt(1 - alpha_{t+1}) * epsilon_theta(x_t, t)

where:

f_theta(x_t, t) is the predicted clean signal (x_0 prediction)
epsilon_theta(x_t, t) is the predicted noise
alpha_t is the noise schedule parameter at timestep t

The trajectory {x_0, x_1, ..., x_T} encodes the structural information of the source video at each noise level. During reconstruction with a new prompt:

At timestep t, the reference latent x_t from inversion provides spatial layout information
The new prompt provides semantic guidance for the edited content
Attention injection blends these two sources of information

The unconditional inversion (empty prompt) is preferred because:

It captures pure structural information without text bias
The inversion is more stable without classifier-free guidance
Text-specific features are only introduced during reconstruction

Related Pages

Implementation:Zai_org_CogVideo_DDIM_Inversion_Sample -- Implementation of the DDIM inversion sampling function
Zai_org_CogVideo_Video_Encoding -- Previous step: encoding video frames to latent space
Zai_org_CogVideo_Prompted_Reconstruction -- Next step: reconstruction with attention injection
Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline loading that provides the inverse scheduler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment