Implementation:Zai org CogVideo DDIM Inversion Sample
| Attribute | Value |
|---|---|
| Implementation Name | DDIM Inversion Sample |
| Workflow | Video Editing DDIM Inversion |
| Step | 4 of 6 |
| Type | API Doc |
| Source File | inference/ddim_inversion.py:L321-452, inference/ddim_inversion.py:L489-498
|
| Repository | zai-org/CogVideo |
| External Dependencies | diffusers, torch |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of the DDIM inversion sampling function. The sample function serves dual purpose: it performs both DDIM inversion (when called with the inverse scheduler and empty prompt) and forward DDIM reconstruction (when called with the forward scheduler and edit prompt). The function stores the full latent trajectory for use in attention injection.
Description
The sample function implements the core DDIM loop:
- Sets up the scheduler timesteps for the specified number of inference steps
- Encodes the prompt (or empty string for inversion) using the pipeline's text encoder
- Iterates over timesteps, at each step:
- Concatenates the latent with itself for classifier-free guidance (if guidance_scale > 1)
- Runs the transformer forward pass to predict noise
- Applies CFG to combine conditional and unconditional predictions
- Steps the scheduler (forward for reconstruction, inverse for inversion)
- Stores the latent in the trajectory
- Returns the complete trajectory tensor
For inversion specifically (lines L489-498): the function is called with DDIMInverseScheduler, an empty prompt, and reference_latents=None. The trajectory is then reversed and passed as reference_latents to the reconstruction call.
Usage
from inference.ddim_inversion import sample
# Inversion
inversion_trajectory = sample(
pipeline=pipe,
latents=encoded_latents,
scheduler=inverse_scheduler,
prompt="",
num_inference_steps=50,
)
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
inference/ddim_inversion.py |
L321-452 | sample function (main DDIM loop)
|
inference/ddim_inversion.py |
L489-498 | Inversion call site |
Signature
def sample(
pipeline: CogVideoXPipeline,
latents: torch.FloatTensor,
scheduler: Union[DDIMInverseScheduler, CogVideoXDDIMScheduler],
prompt: Optional[str] = None,
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
generator: Optional[torch.Generator] = None,
reference_latents: torch.FloatTensor = None,
) -> torch.FloatTensor: # trajectory [num_steps, B, T, C, H', W']
Import
from inference.ddim_inversion import sample
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
pipeline |
CogVideoXPipeline |
Required | Loaded CogVideoX pipeline with transformer, text encoder, and VAE |
latents |
torch.FloatTensor |
Required | Starting latents: encoded video for inversion, or random noise for reconstruction |
scheduler |
Union[DDIMInverseScheduler, CogVideoXDDIMScheduler] |
Required | Inverse scheduler for inversion, forward scheduler for reconstruction |
prompt |
Optional[str] |
None |
Text prompt; empty string for inversion, edit prompt for reconstruction |
num_inference_steps |
int |
50 |
Number of DDIM steps |
guidance_scale |
float |
6.0 |
Classifier-free guidance scale |
generator |
Optional[torch.Generator] |
None |
Random number generator for reproducibility |
reference_latents |
torch.FloatTensor |
None |
Inversion trajectory for attention injection during reconstruction; None for inversion |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | torch.FloatTensor |
Latent trajectory tensor of shape [num_steps, B, T, C, H', W'] containing latents at each timestep
|
Usage Examples
Example 1: DDIM inversion (finding noise representation)
from diffusers import DDIMInverseScheduler
from inference.ddim_inversion import sample, encode_video_frames, get_video_frames
# Prepare inverse scheduler
inverse_scheduler = DDIMInverseScheduler.from_config(pipe.scheduler.config)
# Load and encode video
video_frames = get_video_frames("input.mp4")
latents = encode_video_frames(pipe.vae, video_frames)
# Run inversion
inversion_trajectory = sample(
pipeline=pipe,
latents=latents,
scheduler=inverse_scheduler,
prompt="", # Empty prompt for unconditional inversion
num_inference_steps=50,
guidance_scale=1.0, # No CFG during inversion
)
# inversion_trajectory.shape: [50, 1, T, 16, H', W']
Example 2: Forward reconstruction (verifying inversion quality)
from diffusers import CogVideoXDDIMScheduler
forward_scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config)
# Reverse the trajectory for reconstruction
reversed_trajectory = inversion_trajectory.flip(0)
reconstruction = sample(
pipeline=pipe,
latents=inversion_trajectory[-1], # Start from noise
scheduler=forward_scheduler,
prompt="original prompt",
num_inference_steps=50,
reference_latents=reversed_trajectory,
)
Related Pages
- Principle:Zai_org_CogVideo_DDIM_Inversion -- Principle governing DDIM inversion
- Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
- Zai_org_CogVideo_Encode_Video_Frames -- Previous step: video encoding that produces input latents
- Zai_org_CogVideo_DDIM_Attention_Injection_Reconstruction -- Next step: prompted reconstruction with attention injection
- Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Export step for trajectory endpoints