Implementation:Zai org CogVideo Encode Video Frames

Attribute	Value
Implementation Name	Encode Video Frames
Workflow	Video Editing DDIM Inversion
Step	3 of 6
Type	API Doc
Source File	`inference/ddim_inversion.py:L303-309`
Repository	zai-org/CogVideo
External Dependencies	diffusers (AutoencoderKLCogVideoX)
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of video frame encoding using the CogVideoX 3D VAE. The encode_video_frames function converts preprocessed video frames from pixel space into the latent representation used by the diffusion model.

Description

The encode_video_frames function:

Accepts preprocessed video frames as a [F, C, H, W] tensor
Rearranges the tensor to match the VAE's expected input format
Passes through the VAE encoder to obtain latent distribution parameters
Samples from the distribution and applies the scaling factor
Returns the latent tensor in [B, T, C, H', W'] format

The function wraps the VAE's encode method and handles the scaling factor application in a single call.

Usage

from inference.ddim_inversion import encode_video_frames

latents = encode_video_frames(pipe.vae, video_frames)
# latents shape: [B, T, C, H', W']

Code Reference

Source Location

File	Lines	Description
`inference/ddim_inversion.py`	L303-309	`encode_video_frames` function

Signature

def encode_video_frames(
    vae: AutoencoderKLCogVideoX,
    video_frames: torch.FloatTensor  # [F, C, H, W]
) -> torch.FloatTensor:  # [B, T, C, H', W']

Import

from inference.ddim_inversion import encode_video_frames

I/O Contract

Inputs

Parameter	Type	Default	Description
`vae`	`AutoencoderKLCogVideoX`	Required	The 3D VAE from the loaded CogVideoX pipeline (`pipe.vae`)
`video_frames`	`torch.FloatTensor`	Required	Preprocessed video frames of shape `[F, C, H, W]` with values in `[-1, 1]`

Outputs

Output	Type	Description
Return value	`torch.FloatTensor`	Latent tensor of shape `[B, T, C, H', W']` where `H' = H // 8`, `W' = W // 8`, and `T` is the temporally compressed frame count

Usage Examples

Example 1: Basic video encoding

from inference.ddim_inversion import get_video_frames, encode_video_frames
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")

# Load and preprocess video
video_frames = get_video_frames("input.mp4", width=720, height=480)

# Encode to latent space
latents = encode_video_frames(pipe.vae, video_frames)
# latents.shape: [1, T, 16, 60, 90] for 480x720 input

Example 2: Encoding as part of the inversion pipeline

# Encode video frames
video_frames = get_video_frames(video_path, width=720, height=480, max_num_frames=49)
latents = encode_video_frames(pipe.vae, video_frames)

# latents are now ready for DDIM inversion
inverted = sample(pipe, latents, inverse_scheduler, prompt="")

Related Pages

Principle:Zai_org_CogVideo_Video_Encoding -- Principle governing video encoding with the 3D VAE
Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
Zai_org_CogVideo_Get_Video_Frames -- Previous step: video loading and preprocessing
Zai_org_CogVideo_DDIM_Inversion_Sample -- Next step: DDIM inversion of the encoded latents
Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Decoding that inverts this encoding step

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment