Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo DDIM Inversion

From Leeroopedia


Attribute Value
Principle Name DDIM Inversion
Workflow Video Editing DDIM Inversion
Step 4 of 6
Type Core Algorithm
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for finding the noise-space representation of a video by reversing the DDIM denoising process. DDIM inversion maps clean video latents to their corresponding initial noise tensors, enabling structure-preserving video editing.

Description

DDIM inversion runs the denoising process in reverse: starting from clean video latents and progressively adding noise to find the initial noise tensor that would reconstruct the video. The process consists of:

  1. Unconditional inversion: The inversion is performed with an empty prompt (unconditional) using the DDIMInverseScheduler. This ensures the inversion captures only the structural information of the video, not any text-specific features.
  2. Trajectory storage: The inversion stores latents at each timestep, creating a trajectory {x_0, x_1, ..., x_T}. This trajectory is essential for attention injection during the subsequent reconstruction step.
  3. Noise-space mapping: The final latent x_T represents the noise that, when denoised with the original prompt, should approximately reconstruct the source video.

The trajectory encodes the structural information of the source video at each noise level, which is later used to guide the editing process.

Usage

Use DDIM Inversion after encoding video frames to latent space, and before prompted reconstruction. The inversion trajectory is passed as reference latents to the reconstruction step.

Theoretical Basis

DDIM is deterministic, so the reverse process x_0 -> x_T is well-defined:

Inverse DDIM step:

x_{t+1} = sqrt(alpha_{t+1}) * f_theta(x_t, t) + sqrt(1 - alpha_{t+1}) * epsilon_theta(x_t, t)

where:

  • f_theta(x_t, t) is the predicted clean signal (x_0 prediction)
  • epsilon_theta(x_t, t) is the predicted noise
  • alpha_t is the noise schedule parameter at timestep t

The trajectory {x_0, x_1, ..., x_T} encodes the structural information of the source video at each noise level. During reconstruction with a new prompt:

  • At timestep t, the reference latent x_t from inversion provides spatial layout information
  • The new prompt provides semantic guidance for the edited content
  • Attention injection blends these two sources of information

The unconditional inversion (empty prompt) is preferred because:

  • It captures pure structural information without text bias
  • The inversion is more stable without classifier-free guidance
  • Text-specific features are only introduced during reconstruction

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment