Workflow:Zai org CogVideo Video Editing DDIM Inversion

Knowledge Sources	CogVideo HuggingFace Diffusers
Domains	Video_Generation, Video_Editing, DDIM_Inversion
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for editing existing videos by inverting them into the CogVideoX latent space using DDIM inversion and reconstructing with modified text prompts.

Description

This workflow implements video editing through DDIM (Denoising Diffusion Implicit Models) inversion. The process maps an existing video into the noise space of the CogVideoX diffusion model, then reconstructs the video conditioned on a new text prompt. By preserving the structural information from the original video through the inversion process while applying new semantic guidance, this enables controlled editing of video content such as style transfer, object replacement, or scene modification. The implementation includes attention injection for better structural preservation. This workflow is specifically designed for CogVideoX-5B models.

Usage

Execute this workflow when you have an existing video that you want to edit semantically using text descriptions. For example, changing the style of a video ("make it look like a watercolor painting"), modifying objects ("replace the car with a bicycle"), or altering the environment ("change the background to a forest"). Requires CogVideoX-5B model weights and sufficient GPU memory for full model loading.

Execution Steps

Step 1: Video Loading and Preprocessing

Load the source video file and preprocess frames for the inversion pipeline. Video frames are loaded using decord, optionally trimmed (skip frames from start/end), subsampled to the target frame count, resized to the model's expected resolution (480x720 for CogVideoX-5B), and normalized to the [-1, 1] range expected by the VAE encoder.

Key considerations:

Frame count should follow the 8N+1 rule (typically 49 frames)
Source video is resized to 480x720 (CogVideoX-5B resolution)
`skip_frames_start` and `skip_frames_end` allow trimming
`frame_sample_step` controls temporal subsampling stride
Maximum frame count is configurable (default 81)

Step 2: Model and Pipeline Loading

Load the CogVideoXPipeline and set up the DDIM inverse scheduler alongside the standard DDIM scheduler. The pipeline components include the transformer, T5 text encoder, and 3D VAE. A custom attention processor is injected to store and replay attention maps between the inversion and reconstruction passes, enabling structural consistency.

Key considerations:

Uses CogVideoXDDIMScheduler for both forward and inverse passes
DDIMInverseScheduler is initialized from the same scheduler config
Custom attention injection replaces the default CogVideoXAttnProcessor2_0
Only compatible with CogVideoX-5B (not 2B variants)

Step 3: Video Encoding

Encode the preprocessed video frames into the latent space using the 3D VAE encoder. The video tensor is passed through the encoder to produce a compressed latent representation that captures the spatial and temporal structure of the video.

Key considerations:

VAE encoding compresses the video by a factor of 4x spatial and 4x temporal
Latents are scaled by the VAE's scaling factor
The encoded latents serve as the starting point for DDIM inversion

Step 4: DDIM Inversion

Perform the reverse diffusion process to map the encoded video latents back to a noise representation. Starting from the clean latent, the DDIM inverse scheduler progressively adds noise in a deterministic way that can be reversed. This produces a noise trajectory that, when denoised with the same prompt, would reconstruct the original video.

Key considerations:

The inversion process runs for the configured number of inference steps
Text conditioning during inversion uses the source video's description
The inverted noise captures both content and structure of the original video
Classifier-free guidance is applied during inversion

Step 5: Prompted Reconstruction

Denoise the inverted noise representation using a new text prompt to generate the edited video. The standard DDIM scheduler reconstructs the video from the inverted noise, but conditioned on the new prompt that describes the desired edit. Attention injection replaces stored attention maps at specified steps to preserve structural coherence.

Key considerations:

The new prompt guides what changes should appear in the edited video
Attention injection strength controls the balance between editing and preservation
Guidance scale affects how strongly the new prompt influences the result
The reconstruction follows the same number of steps as the inversion

Step 6: Video Export

Decode the edited latent representation through the 3D VAE decoder and export the resulting frames as an MP4 video file. Both the reconstructed video and optionally the original video are saved for comparison.

Key considerations:

Output is saved to the specified output directory
Both original and edited videos can be exported for comparison
Frame rate matches the source video configuration (default 16 fps)

Execution Diagram

GitHub URL

Workflow Repository