Principle:Zai org CogVideo Prompted Reconstruction

Attribute	Value
Principle Name	Prompted Reconstruction
Workflow	Video Editing DDIM Inversion
Step	5 of 6
Type	Core Algorithm
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for reconstructing a video with edited content by combining DDIM denoising with attention injection from the source video's inversion trajectory. Prompted reconstruction enables structure-preserving video editing by blending spatial layout from the source video with semantic content from a new text prompt.

Description

Prompted reconstruction generates a new video from the edit prompt while maintaining structural consistency with the source video. This is achieved through two mechanisms:

Custom attention processor (CogVideoXAttnProcessor2_0ForDDIMInversion): A modified attention processor that injects reference attention features from the source video's inversion trajectory. At each denoising step, the processor blends attention keys and values from the reference trajectory with those from the current generation.
Forward DDIM sampling with edit prompt: The standard DDIM forward sampling is run with the new edit prompt, but with the custom attention processor active. This allows the model to generate content guided by the edit prompt while being structurally constrained by the reference attention.

The OverrideAttnProcessors context manager temporarily replaces the attention processors in all transformer blocks during reconstruction, restoring the originals when complete.

The reconstruction process:

Replace attention processors with the DDIM inversion variant
Initialize from random noise (or from the final inversion latent)
Run forward DDIM sampling with the edit prompt
At each step, inject reference attention from the reversed inversion trajectory
Restore original attention processors

Usage

Use Prompted Reconstruction after DDIM inversion has produced the inversion trajectory. The reversed trajectory is passed as reference latents to the sample function, with the custom attention processors active.

Theoretical Basis

Attention injection preserves spatial structure during editing. During denoising step t, the attention mechanism operates as follows:

Standard attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V

Attention with injection:

The reference latent at timestep t from the inversion trajectory provides keys K_ref and values V_ref that encode the spatial layout of the source video at that noise level.

The current generation's queries Q attend to a blend of current and reference key-value pairs.

This constrains the edit to modify only semantic content (guided by the new prompt) while preserving spatial layout (from the reference attention). The mechanism works because:

Keys encode what content is at each spatial position
Values encode what should be retrieved from each position
Queries from the current generation seek the same spatial positions as the source, but with new semantic meaning from the edit prompt

The OverrideAttnProcessors pattern uses Python context managers for clean resource management, ensuring that the original attention processors are always restored even if an exception occurs during reconstruction.

Related Pages

Implementation:Zai_org_CogVideo_DDIM_Attention_Injection_Reconstruction -- Implementation of attention injection and reconstruction
Zai_org_CogVideo_DDIM_Inversion -- Previous step: DDIM inversion that produces the reference trajectory
Zai_org_CogVideo_DDIM_Video_Export -- Next step: exporting the reconstructed video
Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline providing the transformer whose attention is modified

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment