Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo Encode Video Frames

From Leeroopedia


Attribute Value
Implementation Name Encode Video Frames
Workflow Video Editing DDIM Inversion
Step 3 of 6
Type API Doc
Source File inference/ddim_inversion.py:L303-309
Repository zai-org/CogVideo
External Dependencies diffusers (AutoencoderKLCogVideoX)
Last Updated 2026-02-10 00:00 GMT

Overview

Implementation of video frame encoding using the CogVideoX 3D VAE. The encode_video_frames function converts preprocessed video frames from pixel space into the latent representation used by the diffusion model.

Description

The encode_video_frames function:

  1. Accepts preprocessed video frames as a [F, C, H, W] tensor
  2. Rearranges the tensor to match the VAE's expected input format
  3. Passes through the VAE encoder to obtain latent distribution parameters
  4. Samples from the distribution and applies the scaling factor
  5. Returns the latent tensor in [B, T, C, H', W'] format

The function wraps the VAE's encode method and handles the scaling factor application in a single call.

Usage

from inference.ddim_inversion import encode_video_frames

latents = encode_video_frames(pipe.vae, video_frames)
# latents shape: [B, T, C, H', W']

Code Reference

Source Location

File Lines Description
inference/ddim_inversion.py L303-309 encode_video_frames function

Signature

def encode_video_frames(
    vae: AutoencoderKLCogVideoX,
    video_frames: torch.FloatTensor  # [F, C, H, W]
) -> torch.FloatTensor:  # [B, T, C, H', W']

Import

from inference.ddim_inversion import encode_video_frames

I/O Contract

Inputs

Parameter Type Default Description
vae AutoencoderKLCogVideoX Required The 3D VAE from the loaded CogVideoX pipeline (pipe.vae)
video_frames torch.FloatTensor Required Preprocessed video frames of shape [F, C, H, W] with values in [-1, 1]

Outputs

Output Type Description
Return value torch.FloatTensor Latent tensor of shape [B, T, C, H', W'] where H' = H // 8, W' = W // 8, and T is the temporally compressed frame count

Usage Examples

Example 1: Basic video encoding

from inference.ddim_inversion import get_video_frames, encode_video_frames
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")

# Load and preprocess video
video_frames = get_video_frames("input.mp4", width=720, height=480)

# Encode to latent space
latents = encode_video_frames(pipe.vae, video_frames)
# latents.shape: [1, T, 16, 60, 90] for 480x720 input

Example 2: Encoding as part of the inversion pipeline

# Encode video frames
video_frames = get_video_frames(video_path, width=720, height=480, max_num_frames=49)
latents = encode_video_frames(pipe.vae, video_frames)

# latents are now ready for DDIM inversion
inverted = sample(pipe, latents, inverse_scheduler, prompt="")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment