Implementation:Zai org CogVideo DDIM Attention Injection Reconstruction

Attribute	Value
Implementation Name	DDIM Attention Injection Reconstruction
Workflow	Video Editing DDIM Inversion
Step	5 of 6
Type	API Doc
Source File	`inference/ddim_inversion.py:L118-243`, `inference/ddim_inversion.py:L246-260`, `inference/ddim_inversion.py:L499-509`
Repository	zai-org/CogVideo
External Dependencies	diffusers (CogVideoXAttnProcessor2_0, CogVideoXBlock, CogVideoXTransformer3DModel), torch.nn.functional
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of the prompted reconstruction step in the DDIM inversion video editing pipeline. This includes the custom attention processor (CogVideoXAttnProcessor2_0ForDDIMInversion), the context manager for attention processor replacement (OverrideAttnProcessors), and the reconstruction call that combines these components.

Description

Three components work together for prompted reconstruction:

CogVideoXAttnProcessor2_0ForDDIMInversion (L118-243): Extends the standard CogVideoXAttnProcessor2_0 to inject reference attention features from the source video's inversion trajectory. During each attention computation, it blends reference keys/values with current keys/values.
OverrideAttnProcessors (L246-260): A Python context manager that temporarily replaces all attention processors in the transformer with the DDIM inversion variant. On entry, it swaps processors; on exit, it restores the originals.
Reconstruction call (L499-509): Uses the context manager and calls the sample function with the forward scheduler, edit prompt, and reversed inversion trajectory as reference latents.

Usage

from inference.ddim_inversion import OverrideAttnProcessors, sample

with OverrideAttnProcessors(pipe.transformer):
    reconstruction_trajectory = sample(
        pipeline=pipe,
        latents=torch.randn_like(latents),
        scheduler=pipe.scheduler,
        prompt=edit_prompt,
        reference_latents=reversed_inversion_trajectory,
    )

Code Reference

Source Location

File	Lines	Description
`inference/ddim_inversion.py`	L118-243	`CogVideoXAttnProcessor2_0ForDDIMInversion` class
`inference/ddim_inversion.py`	L246-260	`OverrideAttnProcessors` context manager
`inference/ddim_inversion.py`	L499-509	Reconstruction call site

Signature

class CogVideoXAttnProcessor2_0ForDDIMInversion(CogVideoXAttnProcessor2_0):
    """Custom attention processor that injects reference attention features."""

class OverrideAttnProcessors:
    """Context manager to temporarily replace attention processors."""
    def __init__(self, transformer: CogVideoXTransformer3DModel): ...

# Usage:
with OverrideAttnProcessors(pipe.transformer):
    reconstruction_trajectory = sample(
        pipeline=pipe,
        latents=torch.randn_like(latents),
        scheduler=pipe.scheduler,  # CogVideoXDDIMScheduler
        prompt=edit_prompt,
        reference_latents=reversed_inversion_trajectory,
    )

Import

from inference.ddim_inversion import (
    CogVideoXAttnProcessor2_0ForDDIMInversion,
    OverrideAttnProcessors,
    sample,
)

I/O Contract

Inputs

CogVideoXAttnProcessor2_0ForDDIMInversion

Parameter	Type	Default	Description
Inherits from `CogVideoXAttnProcessor2_0`	--	--	All standard attention processor inputs (hidden_states, encoder_hidden_states, attention_mask, etc.)
Reference features	`torch.FloatTensor`	Via reference_latents	Attention keys/values from the inversion trajectory at the current timestep

OverrideAttnProcessors

Parameter	Type	Default	Description
`transformer`	`CogVideoXTransformer3DModel`	Required	The pipeline's transformer model whose attention processors will be replaced

Reconstruction call

Parameter	Type	Default	Description
`pipeline`	`CogVideoXPipeline`	Required	Loaded CogVideoX pipeline
`latents`	`torch.FloatTensor`	Required	Random noise tensor (same shape as encoded video latents)
`scheduler`	`CogVideoXDDIMScheduler`	Required	Forward DDIM scheduler
`prompt`	`str`	Required	Edit prompt describing the desired output
`reference_latents`	`torch.FloatTensor`	Required	Reversed inversion trajectory of shape `[num_steps, B, T, C, H', W']`

Outputs

Output	Type	Description
`reconstruction_trajectory`	`torch.FloatTensor`	Reconstruction trajectory of shape `[num_steps, B, T, C, H', W']`; final step contains the edited video latents

Usage Examples

Example 1: Full video editing pipeline

from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, DDIMInverseScheduler
from inference.ddim_inversion import (
    get_video_frames, encode_video_frames, sample,
    OverrideAttnProcessors, export_latents_to_video,
)
import torch

# Load pipeline
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")

# Prepare schedulers
inverse_scheduler = DDIMInverseScheduler.from_config(pipe.scheduler.config)
forward_scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config)

# Load and encode video
video_frames = get_video_frames("input.mp4")
latents = encode_video_frames(pipe.vae, video_frames)

# Step 1: Inversion
inversion_trajectory = sample(
    pipe, latents, inverse_scheduler, prompt="", num_inference_steps=50
)

# Step 2: Reconstruction with edit
reversed_trajectory = inversion_trajectory.flip(0)

with OverrideAttnProcessors(pipe.transformer):
    reconstruction = sample(
        pipe,
        torch.randn_like(latents),
        forward_scheduler,
        prompt="A dog playing in snow",
        reference_latents=reversed_trajectory,
        num_inference_steps=50,
    )

# Export edited video
export_latents_to_video(pipe, reconstruction[-1], "edited_output.mp4")

Example 2: Using the context manager pattern

# The OverrideAttnProcessors context manager ensures
# original processors are restored after reconstruction
with OverrideAttnProcessors(pipe.transformer):
    # Inside: attention processors are replaced with DDIM inversion variants
    result = sample(pipe, noise, forward_scheduler, "new prompt",
                    reference_latents=ref)
# Outside: original processors are restored

# Pipeline can be used normally for other tasks

Related Pages

Principle:Zai_org_CogVideo_Prompted_Reconstruction -- Principle governing attention injection and prompted reconstruction
Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
Zai_org_CogVideo_DDIM_Inversion_Sample -- Previous step: DDIM inversion producing the reference trajectory
Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Next step: exporting the edited video
Zai_org_CogVideo_DDIM_CogVideoXPipeline_From_Pretrained -- Pipeline loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment