Principle:Zai org CogVideo Video Loading and Preprocessing
| Attribute | Value |
|---|---|
| Principle Name | Video Loading and Preprocessing |
| Workflow | Video Editing DDIM Inversion |
| Step | 1 of 6 |
| Type | Data Input |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for loading, resizing, and normalizing video frames for input to a diffusion model's VAE encoder. Video preprocessing transforms raw video files into a standardized tensor format compatible with the CogVideoX pipeline.
Description
Video preprocessing operates through a sequence of transformations:
- Frame loading: Video frames are loaded from file using the decord library, which provides efficient GPU-accelerated video decoding.
- Frame sampling: A subset of frames is selected to match the target frame count. Supports uniform stepping via
frame_sample_stepand start/end frame skipping viaskip_frames_startandskip_frames_end. - Resizing: Frames are resized to the target resolution (
width x height) using bicubic interpolation. - Normalization: Pixel values are converted from
[0, 255]integer range to[-1, 1]floating-point range, matching the distribution expected by the VAE encoder.
The frame count must satisfy the VAE temporal compression constraint: the number of frames F must satisfy (F mod 4) == 1 (e.g., 1, 5, 9, 13, ..., 81).
Usage
Use Video Loading and Preprocessing as the first step of the DDIM inversion pipeline, before encoding frames to latent space. The output tensor is passed directly to the VAE encoder.
Theoretical Basis
The VAE temporal compression requires input frames to satisfy (F mod 4 == 1) constraint. This arises from the 3D VAE architecture, which performs 4x temporal downsampling via strided convolutions. The +1 accounts for the temporal boundary handling in the encoder.
Normalization to [-1, 1] matches the training distribution expected by the VAE encoder:
x_normalized = (x / 255.0) * 2.0 - 1.0
This centering around zero is standard for diffusion model VAEs, as it aligns with the Gaussian prior assumption in the latent space.
Frame sampling strategies trade off between temporal coverage and computational cost:
- Uniform stepping: Samples every
k-th frame, providing even temporal coverage. - Start/end skipping: Removes potentially uninformative frames (e.g., title cards, fade-outs).
Related Pages
- Implementation:Zai_org_CogVideo_Get_Video_Frames -- Implementation of video loading and preprocessing
- Zai_org_CogVideo_Video_Encoding -- Next step: encoding preprocessed frames to latent space
- Zai_org_CogVideo_DDIM_Pipeline_Loading -- Pipeline loading that provides the VAE for encoding