Implementation:Zai org CogVideo Encode Video Frames
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | Encode Video Frames |
| Workflow | Video Editing DDIM Inversion |
| Step | 3 of 6 |
| Type | API Doc |
| Source File | inference/ddim_inversion.py:L303-309
|
| Repository | zai-org/CogVideo |
| External Dependencies | diffusers (AutoencoderKLCogVideoX) |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of video frame encoding using the CogVideoX 3D VAE. The encode_video_frames function converts preprocessed video frames from pixel space into the latent representation used by the diffusion model.
Description
The encode_video_frames function:
- Accepts preprocessed video frames as a
[F, C, H, W]tensor - Rearranges the tensor to match the VAE's expected input format
- Passes through the VAE encoder to obtain latent distribution parameters
- Samples from the distribution and applies the scaling factor
- Returns the latent tensor in
[B, T, C, H', W']format
The function wraps the VAE's encode method and handles the scaling factor application in a single call.
Usage
from inference.ddim_inversion import encode_video_frames
latents = encode_video_frames(pipe.vae, video_frames)
# latents shape: [B, T, C, H', W']
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
inference/ddim_inversion.py |
L303-309 | encode_video_frames function
|
Signature
def encode_video_frames(
vae: AutoencoderKLCogVideoX,
video_frames: torch.FloatTensor # [F, C, H, W]
) -> torch.FloatTensor: # [B, T, C, H', W']
Import
from inference.ddim_inversion import encode_video_frames
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
vae |
AutoencoderKLCogVideoX |
Required | The 3D VAE from the loaded CogVideoX pipeline (pipe.vae)
|
video_frames |
torch.FloatTensor |
Required | Preprocessed video frames of shape [F, C, H, W] with values in [-1, 1]
|
Outputs
| Output | Type | Description |
|---|---|---|
| Return value | torch.FloatTensor |
Latent tensor of shape [B, T, C, H', W'] where H' = H // 8, W' = W // 8, and T is the temporally compressed frame count
|
Usage Examples
Example 1: Basic video encoding
from inference.ddim_inversion import get_video_frames, encode_video_frames
from diffusers import CogVideoXPipeline
import torch
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
).to("cuda")
# Load and preprocess video
video_frames = get_video_frames("input.mp4", width=720, height=480)
# Encode to latent space
latents = encode_video_frames(pipe.vae, video_frames)
# latents.shape: [1, T, 16, 60, 90] for 480x720 input
Example 2: Encoding as part of the inversion pipeline
# Encode video frames
video_frames = get_video_frames(video_path, width=720, height=480, max_num_frames=49)
latents = encode_video_frames(pipe.vae, video_frames)
# latents are now ready for DDIM inversion
inverted = sample(pipe, latents, inverse_scheduler, prompt="")
Related Pages
- Principle:Zai_org_CogVideo_Video_Encoding -- Principle governing video encoding with the 3D VAE
- Environment:Zai_org_CogVideo_Diffusers_Inference_Environment
- Zai_org_CogVideo_Get_Video_Frames -- Previous step: video loading and preprocessing
- Zai_org_CogVideo_DDIM_Inversion_Sample -- Next step: DDIM inversion of the encoded latents
- Zai_org_CogVideo_DDIM_Export_Latents_To_Video -- Decoding that inverts this encoding step
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment