Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Prompted Reconstruction

From Leeroopedia


Attribute Value
Principle Name Prompted Reconstruction
Workflow Video Editing DDIM Inversion
Step 5 of 6
Type Core Algorithm
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for reconstructing a video with edited content by combining DDIM denoising with attention injection from the source video's inversion trajectory. Prompted reconstruction enables structure-preserving video editing by blending spatial layout from the source video with semantic content from a new text prompt.

Description

Prompted reconstruction generates a new video from the edit prompt while maintaining structural consistency with the source video. This is achieved through two mechanisms:

  1. Custom attention processor (CogVideoXAttnProcessor2_0ForDDIMInversion): A modified attention processor that injects reference attention features from the source video's inversion trajectory. At each denoising step, the processor blends attention keys and values from the reference trajectory with those from the current generation.
  2. Forward DDIM sampling with edit prompt: The standard DDIM forward sampling is run with the new edit prompt, but with the custom attention processor active. This allows the model to generate content guided by the edit prompt while being structurally constrained by the reference attention.

The OverrideAttnProcessors context manager temporarily replaces the attention processors in all transformer blocks during reconstruction, restoring the originals when complete.

The reconstruction process:

  1. Replace attention processors with the DDIM inversion variant
  2. Initialize from random noise (or from the final inversion latent)
  3. Run forward DDIM sampling with the edit prompt
  4. At each step, inject reference attention from the reversed inversion trajectory
  5. Restore original attention processors

Usage

Use Prompted Reconstruction after DDIM inversion has produced the inversion trajectory. The reversed trajectory is passed as reference latents to the sample function, with the custom attention processors active.

Theoretical Basis

Attention injection preserves spatial structure during editing. During denoising step t, the attention mechanism operates as follows:

Standard attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V

Attention with injection:

The reference latent at timestep t from the inversion trajectory provides keys K_ref and values V_ref that encode the spatial layout of the source video at that noise level.
The current generation's queries Q attend to a blend of current and reference key-value pairs.

This constrains the edit to modify only semantic content (guided by the new prompt) while preserving spatial layout (from the reference attention). The mechanism works because:

  • Keys encode what content is at each spatial position
  • Values encode what should be retrieved from each position
  • Queries from the current generation seek the same spatial positions as the source, but with new semantic meaning from the edit prompt

The OverrideAttnProcessors pattern uses Python context managers for clean resource management, ensuring that the original attention processors are always restored even if an exception occurs during reconstruction.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment