Implementation:Zai org CogVideo DDIM Inversion Sample

Attribute	Value
Implementation Name	DDIM Inversion Sample
Workflow	Video Editing DDIM Inversion
Step	4 of 6
Type	API Doc
Source File	`inference/ddim_inversion.py:L321-452`, `inference/ddim_inversion.py:L489-498`
Repository	zai-org/CogVideo
External Dependencies	diffusers, torch
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of the DDIM inversion sampling function. The sample function serves dual purpose: it performs both DDIM inversion (when called with the inverse scheduler and empty prompt) and forward DDIM reconstruction (when called with the forward scheduler and edit prompt). The function stores the full latent trajectory for use in attention injection.

Description

The sample function implements the core DDIM loop:

Sets up the scheduler timesteps for the specified number of inference steps
Encodes the prompt (or empty string for inversion) using the pipeline's text encoder
Iterates over timesteps, at each step:
- Concatenates the latent with itself for classifier-free guidance (if guidance_scale > 1)
- Runs the transformer forward pass to predict noise
- Applies CFG to combine conditional and unconditional predictions
- Steps the scheduler (forward for reconstruction, inverse for inversion)
- Stores the latent in the trajectory
Returns the complete trajectory tensor

For inversion specifically (lines L489-498): the function is called with DDIMInverseScheduler, an empty prompt, and reference_latents=None. The trajectory is then reversed and passed as reference_latents to the reconstruction call.

Usage

from inference.ddim_inversion import sample

# Inversion
inversion_trajectory = sample(
    pipeline=pipe,
    latents=encoded_latents,
    scheduler=inverse_scheduler,
    prompt="",
    num_inference_steps=50,
)

Code Reference

Source Location

File	Lines	Description
`inference/ddim_inversion.py`	L321-452	`sample` function (main DDIM loop)
`inference/ddim_inversion.py`	L489-498	Inversion call site

Signature

def sample(
    pipeline: CogVideoXPipeline,
    latents: torch.FloatTensor,
    scheduler: Union[DDIMInverseScheduler, CogVideoXDDIMScheduler],
    prompt: Optional[str] = None,
    num_inference_steps: int = 50,
    guidance_scale: float = 6.0,
    generator: Optional[torch.Generator] = None,
    reference_latents: torch.FloatTensor = None,
) -> torch.FloatTensor:  # trajectory [num_steps, B, T, C, H', W']

Import

from inference.ddim_inversion import sample

I/O Contract

Inputs

Parameter	Type	Default	Description
`pipeline`	`CogVideoXPipeline`	Required	Loaded CogVideoX pipeline with transformer, text encoder, and VAE
`latents`	`torch.FloatTensor`	Required	Starting latents: encoded video for inversion, or random noise for reconstruction
`scheduler`	`Union[DDIMInverseScheduler, CogVideoXDDIMScheduler]`	Required	Inverse scheduler for inversion, forward scheduler for reconstruction
`prompt`	`Optional[str]`	`None`	Text prompt; empty string for inversion, edit prompt for reconstruction
`num_inference_steps`	`int`	`50`	Number of DDIM steps
`guidance_scale`	`float`	`6.0`	Classifier-free guidance scale
`generator`	`Optional[torch.Generator]`	`None`	Random number generator for reproducibility
`reference_latents`	`torch.FloatTensor`	`None`	Inversion trajectory for attention injection during reconstruction; None for inversion

Outputs

Output	Type	Description
Return value	`torch.FloatTensor`	Latent trajectory tensor of shape `[num_steps, B, T, C, H', W']` containing latents at each timestep

Usage Examples

Example 1: DDIM inversion (finding noise representation)

from diffusers import DDIMInverseScheduler
from inference.ddim_inversion import sample, encode_video_frames, get_video_frames

# Prepare inverse scheduler
inverse_scheduler = DDIMInverseScheduler.from_config(pipe.scheduler.config)

# Load and encode video
video_frames = get_video_frames("input.mp4")
latents = encode_video_frames(pipe.vae, video_frames)

# Run inversion
inversion_trajectory = sample(
    pipeline=pipe,
    latents=latents,
    scheduler=inverse_scheduler,
    prompt="",  # Empty prompt for unconditional inversion
    num_inference_steps=50,
    guidance_scale=1.0,  # No CFG during inversion
)
# inversion_trajectory.shape: [50, 1, T, 16, H', W']

Example 2: Forward reconstruction (verifying inversion quality)

from diffusers import CogVideoXDDIMScheduler

forward_scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config)

# Reverse the trajectory for reconstruction
reversed_trajectory = inversion_trajectory.flip(0)

reconstruction = sample(
    pipeline=pipe,
    latents=inversion_trajectory[-1],  # Start from noise
    scheduler=forward_scheduler,
    prompt="original prompt",
    num_inference_steps=50,
    reference_latents=reversed_trajectory,
)

Related Pages

Principle:Zai_org_CogVideo_DDIM_Inversion -- Principle governing DDIM inversion
Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
Zai_org_CogVideo_Encode_Video_Frames -- Previous step: video encoding that produces input latents
Zai_org_CogVideo_DDIM_Attention_Injection_Reconstruction -- Next step: prompted reconstruction with attention injection
Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Export step for trajectory endpoints

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment