Principle:Zai org CogVideo Video Loading and Preprocessing

Attribute	Value
Principle Name	Video Loading and Preprocessing
Workflow	Video Editing DDIM Inversion
Step	1 of 6
Type	Data Input
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for loading, resizing, and normalizing video frames for input to a diffusion model's VAE encoder. Video preprocessing transforms raw video files into a standardized tensor format compatible with the CogVideoX pipeline.

Description

Video preprocessing operates through a sequence of transformations:

Frame loading: Video frames are loaded from file using the decord library, which provides efficient GPU-accelerated video decoding.
Frame sampling: A subset of frames is selected to match the target frame count. Supports uniform stepping via frame_sample_step and start/end frame skipping via skip_frames_start and skip_frames_end.
Resizing: Frames are resized to the target resolution (width x height) using bicubic interpolation.
Normalization: Pixel values are converted from [0, 255] integer range to [-1, 1] floating-point range, matching the distribution expected by the VAE encoder.

The frame count must satisfy the VAE temporal compression constraint: the number of frames F must satisfy (F mod 4) == 1 (e.g., 1, 5, 9, 13, ..., 81).

Usage

Use Video Loading and Preprocessing as the first step of the DDIM inversion pipeline, before encoding frames to latent space. The output tensor is passed directly to the VAE encoder.

Theoretical Basis

The VAE temporal compression requires input frames to satisfy (F mod 4 == 1) constraint. This arises from the 3D VAE architecture, which performs 4x temporal downsampling via strided convolutions. The +1 accounts for the temporal boundary handling in the encoder.

Normalization to [-1, 1] matches the training distribution expected by the VAE encoder:

x_normalized = (x / 255.0) * 2.0 - 1.0

This centering around zero is standard for diffusion model VAEs, as it aligns with the Gaussian prior assumption in the latent space.

Frame sampling strategies trade off between temporal coverage and computational cost:

Uniform stepping: Samples every k-th frame, providing even temporal coverage.
Start/end skipping: Removes potentially uninformative frames (e.g., title cards, fade-outs).

Related Pages

Implementation:Zai_org_CogVideo_Get_Video_Frames -- Implementation of video loading and preprocessing
Zai_org_CogVideo_Video_Encoding -- Next step: encoding preprocessed frames to latent space
Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline loading that provides the VAE for encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment