Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Video Loading and Preprocessing

From Leeroopedia


Attribute Value
Principle Name Video Loading and Preprocessing
Workflow Video Editing DDIM Inversion
Step 1 of 6
Type Data Input
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for loading, resizing, and normalizing video frames for input to a diffusion model's VAE encoder. Video preprocessing transforms raw video files into a standardized tensor format compatible with the CogVideoX pipeline.

Description

Video preprocessing operates through a sequence of transformations:

  1. Frame loading: Video frames are loaded from file using the decord library, which provides efficient GPU-accelerated video decoding.
  2. Frame sampling: A subset of frames is selected to match the target frame count. Supports uniform stepping via frame_sample_step and start/end frame skipping via skip_frames_start and skip_frames_end.
  3. Resizing: Frames are resized to the target resolution (width x height) using bicubic interpolation.
  4. Normalization: Pixel values are converted from [0, 255] integer range to [-1, 1] floating-point range, matching the distribution expected by the VAE encoder.

The frame count must satisfy the VAE temporal compression constraint: the number of frames F must satisfy (F mod 4) == 1 (e.g., 1, 5, 9, 13, ..., 81).

Usage

Use Video Loading and Preprocessing as the first step of the DDIM inversion pipeline, before encoding frames to latent space. The output tensor is passed directly to the VAE encoder.

Theoretical Basis

The VAE temporal compression requires input frames to satisfy (F mod 4 == 1) constraint. This arises from the 3D VAE architecture, which performs 4x temporal downsampling via strided convolutions. The +1 accounts for the temporal boundary handling in the encoder.

Normalization to [-1, 1] matches the training distribution expected by the VAE encoder:

x_normalized = (x / 255.0) * 2.0 - 1.0

This centering around zero is standard for diffusion model VAEs, as it aligns with the Gaussian prior assumption in the latent space.

Frame sampling strategies trade off between temporal coverage and computational cost:

  • Uniform stepping: Samples every k-th frame, providing even temporal coverage.
  • Start/end skipping: Removes potentially uninformative frames (e.g., title cards, fade-outs).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment