Principle:Zai org CogVideo Prompted Reconstruction
| Attribute | Value |
|---|---|
| Principle Name | Prompted Reconstruction |
| Workflow | Video Editing DDIM Inversion |
| Step | 5 of 6 |
| Type | Core Algorithm |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for reconstructing a video with edited content by combining DDIM denoising with attention injection from the source video's inversion trajectory. Prompted reconstruction enables structure-preserving video editing by blending spatial layout from the source video with semantic content from a new text prompt.
Description
Prompted reconstruction generates a new video from the edit prompt while maintaining structural consistency with the source video. This is achieved through two mechanisms:
- Custom attention processor (
CogVideoXAttnProcessor2_0ForDDIMInversion): A modified attention processor that injects reference attention features from the source video's inversion trajectory. At each denoising step, the processor blends attention keys and values from the reference trajectory with those from the current generation. - Forward DDIM sampling with edit prompt: The standard DDIM forward sampling is run with the new edit prompt, but with the custom attention processor active. This allows the model to generate content guided by the edit prompt while being structurally constrained by the reference attention.
The OverrideAttnProcessors context manager temporarily replaces the attention processors in all transformer blocks during reconstruction, restoring the originals when complete.
The reconstruction process:
- Replace attention processors with the DDIM inversion variant
- Initialize from random noise (or from the final inversion latent)
- Run forward DDIM sampling with the edit prompt
- At each step, inject reference attention from the reversed inversion trajectory
- Restore original attention processors
Usage
Use Prompted Reconstruction after DDIM inversion has produced the inversion trajectory. The reversed trajectory is passed as reference latents to the sample function, with the custom attention processors active.
Theoretical Basis
Attention injection preserves spatial structure during editing. During denoising step t, the attention mechanism operates as follows:
Standard attention:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
Attention with injection:
- The reference latent at timestep
tfrom the inversion trajectory provides keysK_refand valuesV_refthat encode the spatial layout of the source video at that noise level. - The current generation's queries
Qattend to a blend of current and reference key-value pairs.
This constrains the edit to modify only semantic content (guided by the new prompt) while preserving spatial layout (from the reference attention). The mechanism works because:
- Keys encode what content is at each spatial position
- Values encode what should be retrieved from each position
- Queries from the current generation seek the same spatial positions as the source, but with new semantic meaning from the edit prompt
The OverrideAttnProcessors pattern uses Python context managers for clean resource management, ensuring that the original attention processors are always restored even if an exception occurs during reconstruction.
Related Pages
- Implementation:Zai_org_CogVideo_DDIM_Attention_Injection_Reconstruction -- Implementation of attention injection and reconstruction
- Zai_org_CogVideo_DDIM_Inversion -- Previous step: DDIM inversion that produces the reference trajectory
- Zai_org_CogVideo_DDIM_Video_Export -- Next step: exporting the reconstructed video
- Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline providing the transformer whose attention is modified