Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Video Encoding

From Leeroopedia


Attribute Value
Principle Name Video Encoding
Workflow Video Editing DDIM Inversion
Step 3 of 6
Type Feature Extraction
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for encoding video frames into a compact latent representation using a 3D Variational Autoencoder. Video encoding compresses pixel-space frames into a lower-dimensional latent space suitable for diffusion processing.

Description

The VAE encoder compresses video frames from pixel space [F, C, H, W] into latent space [B, T, C', H', W'] with:

  • Spatial downsampling: Factor of 8 in both height and width dimensions
  • Temporal compression: Reduces the number of temporal frames
  • Channel expansion: Increases channel count to the latent dimension (typically 16)

The latent representation captures the essential structure of the video in a lower-dimensional space. The scaling factor normalizes latent magnitudes to ensure compatibility with the diffusion model's noise schedule.

The encoding process:

  1. Rearrange input frames to the batch format expected by the VAE
  2. Pass through the VAE encoder to obtain the latent distribution
  3. Sample from the latent distribution (using the mean for deterministic encoding)
  4. Apply the scaling factor to normalize latent magnitudes

Usage

Use Video Encoding after video preprocessing and before DDIM inversion. The encoded latents serve as the starting point for the inversion process, which maps them to noise space.

Theoretical Basis

VAE encoding follows the variational inference framework:

Encoding:

z = scale_factor * VAE.encode(x).latent_dist.sample()

The latent space is regularized during VAE training to be approximately Gaussian, which enables diffusion model training in this space. The KL divergence regularization term ensures that the encoder outputs lie in a well-structured latent space:

L_VAE = L_reconstruction + beta * KL(q(z|x) || p(z))

where p(z) = N(0, I) is the standard Gaussian prior.

The scale_factor is computed during VAE training as the inverse of the standard deviation of latent activations on the training set. This normalization ensures that latent values have approximately unit variance, which is important for the diffusion model's noise schedule to work correctly.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment