Implementation:Zai org CogVideo DDIM Attention Injection Reconstruction
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | DDIM Attention Injection Reconstruction |
| Workflow | Video Editing DDIM Inversion |
| Step | 5 of 6 |
| Type | API Doc |
| Source File | inference/ddim_inversion.py:L118-243, inference/ddim_inversion.py:L246-260, inference/ddim_inversion.py:L499-509
|
| Repository | zai-org/CogVideo |
| External Dependencies | diffusers (CogVideoXAttnProcessor2_0, CogVideoXBlock, CogVideoXTransformer3DModel), torch.nn.functional |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of the prompted reconstruction step in the DDIM inversion video editing pipeline. This includes the custom attention processor (CogVideoXAttnProcessor2_0ForDDIMInversion), the context manager for attention processor replacement (OverrideAttnProcessors), and the reconstruction call that combines these components.
Description
Three components work together for prompted reconstruction:
CogVideoXAttnProcessor2_0ForDDIMInversion(L118-243): Extends the standardCogVideoXAttnProcessor2_0to inject reference attention features from the source video's inversion trajectory. During each attention computation, it blends reference keys/values with current keys/values.OverrideAttnProcessors(L246-260): A Python context manager that temporarily replaces all attention processors in the transformer with the DDIM inversion variant. On entry, it swaps processors; on exit, it restores the originals.- Reconstruction call (L499-509): Uses the context manager and calls the
samplefunction with the forward scheduler, edit prompt, and reversed inversion trajectory as reference latents.
Usage
from inference.ddim_inversion import OverrideAttnProcessors, sample
with OverrideAttnProcessors(pipe.transformer):
reconstruction_trajectory = sample(
pipeline=pipe,
latents=torch.randn_like(latents),
scheduler=pipe.scheduler,
prompt=edit_prompt,
reference_latents=reversed_inversion_trajectory,
)
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
inference/ddim_inversion.py |
L118-243 | CogVideoXAttnProcessor2_0ForDDIMInversion class
|
inference/ddim_inversion.py |
L246-260 | OverrideAttnProcessors context manager
|
inference/ddim_inversion.py |
L499-509 | Reconstruction call site |
Signature
class CogVideoXAttnProcessor2_0ForDDIMInversion(CogVideoXAttnProcessor2_0):
"""Custom attention processor that injects reference attention features."""
class OverrideAttnProcessors:
"""Context manager to temporarily replace attention processors."""
def __init__(self, transformer: CogVideoXTransformer3DModel): ...
# Usage:
with OverrideAttnProcessors(pipe.transformer):
reconstruction_trajectory = sample(
pipeline=pipe,
latents=torch.randn_like(latents),
scheduler=pipe.scheduler, # CogVideoXDDIMScheduler
prompt=edit_prompt,
reference_latents=reversed_inversion_trajectory,
)
Import
from inference.ddim_inversion import (
CogVideoXAttnProcessor2_0ForDDIMInversion,
OverrideAttnProcessors,
sample,
)
I/O Contract
Inputs
CogVideoXAttnProcessor2_0ForDDIMInversion
| Parameter | Type | Default | Description |
|---|---|---|---|
Inherits from CogVideoXAttnProcessor2_0 |
-- | -- | All standard attention processor inputs (hidden_states, encoder_hidden_states, attention_mask, etc.) |
| Reference features | torch.FloatTensor |
Via reference_latents | Attention keys/values from the inversion trajectory at the current timestep |
OverrideAttnProcessors
| Parameter | Type | Default | Description |
|---|---|---|---|
transformer |
CogVideoXTransformer3DModel |
Required | The pipeline's transformer model whose attention processors will be replaced |
Reconstruction call
| Parameter | Type | Default | Description |
|---|---|---|---|
pipeline |
CogVideoXPipeline |
Required | Loaded CogVideoX pipeline |
latents |
torch.FloatTensor |
Required | Random noise tensor (same shape as encoded video latents) |
scheduler |
CogVideoXDDIMScheduler |
Required | Forward DDIM scheduler |
prompt |
str |
Required | Edit prompt describing the desired output |
reference_latents |
torch.FloatTensor |
Required | Reversed inversion trajectory of shape [num_steps, B, T, C, H', W']
|
Outputs
| Output | Type | Description |
|---|---|---|
reconstruction_trajectory |
torch.FloatTensor |
Reconstruction trajectory of shape [num_steps, B, T, C, H', W']; final step contains the edited video latents
|
Usage Examples
Example 1: Full video editing pipeline
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, DDIMInverseScheduler
from inference.ddim_inversion import (
get_video_frames, encode_video_frames, sample,
OverrideAttnProcessors, export_latents_to_video,
)
import torch
# Load pipeline
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")
# Prepare schedulers
inverse_scheduler = DDIMInverseScheduler.from_config(pipe.scheduler.config)
forward_scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config)
# Load and encode video
video_frames = get_video_frames("input.mp4")
latents = encode_video_frames(pipe.vae, video_frames)
# Step 1: Inversion
inversion_trajectory = sample(
pipe, latents, inverse_scheduler, prompt="", num_inference_steps=50
)
# Step 2: Reconstruction with edit
reversed_trajectory = inversion_trajectory.flip(0)
with OverrideAttnProcessors(pipe.transformer):
reconstruction = sample(
pipe,
torch.randn_like(latents),
forward_scheduler,
prompt="A dog playing in snow",
reference_latents=reversed_trajectory,
num_inference_steps=50,
)
# Export edited video
export_latents_to_video(pipe, reconstruction[-1], "edited_output.mp4")
Example 2: Using the context manager pattern
# The OverrideAttnProcessors context manager ensures
# original processors are restored after reconstruction
with OverrideAttnProcessors(pipe.transformer):
# Inside: attention processors are replaced with DDIM inversion variants
result = sample(pipe, noise, forward_scheduler, "new prompt",
reference_latents=ref)
# Outside: original processors are restored
# Pipeline can be used normally for other tasks
Related Pages
- Principle:Zai_org_CogVideo_Prompted_Reconstruction -- Principle governing attention injection and prompted reconstruction
- Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
- Zai_org_CogVideo_DDIM_Inversion_Sample -- Previous step: DDIM inversion producing the reference trajectory
- Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Next step: exporting the edited video
- Zai_org_CogVideo_DDIM_CogVideoXPipeline_From_Pretrained -- Pipeline loading
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment