Principle:Zai org CogVideo Video Encoding

Attribute	Value
Principle Name	Video Encoding
Workflow	Video Editing DDIM Inversion
Step	3 of 6
Type	Feature Extraction
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for encoding video frames into a compact latent representation using a 3D Variational Autoencoder. Video encoding compresses pixel-space frames into a lower-dimensional latent space suitable for diffusion processing.

Description

The VAE encoder compresses video frames from pixel space [F, C, H, W] into latent space [B, T, C', H', W'] with:

Spatial downsampling: Factor of 8 in both height and width dimensions
Temporal compression: Reduces the number of temporal frames
Channel expansion: Increases channel count to the latent dimension (typically 16)

The latent representation captures the essential structure of the video in a lower-dimensional space. The scaling factor normalizes latent magnitudes to ensure compatibility with the diffusion model's noise schedule.

The encoding process:

Rearrange input frames to the batch format expected by the VAE
Pass through the VAE encoder to obtain the latent distribution
Sample from the latent distribution (using the mean for deterministic encoding)
Apply the scaling factor to normalize latent magnitudes

Usage

Use Video Encoding after video preprocessing and before DDIM inversion. The encoded latents serve as the starting point for the inversion process, which maps them to noise space.

Theoretical Basis

VAE encoding follows the variational inference framework:

Encoding:

z = scale_factor * VAE.encode(x).latent_dist.sample()

The latent space is regularized during VAE training to be approximately Gaussian, which enables diffusion model training in this space. The KL divergence regularization term ensures that the encoder outputs lie in a well-structured latent space:

L_VAE = L_reconstruction + beta * KL(q(z|x) || p(z))

where p(z) = N(0, I) is the standard Gaussian prior.

The scale_factor is computed during VAE training as the inverse of the standard deviation of latent activations on the training set. This normalization ensures that latent values have approximately unit variance, which is important for the diffusion model's noise schedule to work correctly.

Related Pages

Implementation:Zai_org_CogVideo_Encode_Video_Frames -- Implementation of video encoding
Zai_org_CogVideo_Video_Loading_and_Preprocessing -- Previous step: video preprocessing that produces input frames
Zai_org_CogVideo_DDIM_Inversion -- Next step: inverting the encoded latents to noise space
Zai_org_CogVideo_DDIM_Video_Export -- Decoding step that inverts the encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment