Principle:Zai org CogVideo Video Encoding
| Attribute | Value |
|---|---|
| Principle Name | Video Encoding |
| Workflow | Video Editing DDIM Inversion |
| Step | 3 of 6 |
| Type | Feature Extraction |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for encoding video frames into a compact latent representation using a 3D Variational Autoencoder. Video encoding compresses pixel-space frames into a lower-dimensional latent space suitable for diffusion processing.
Description
The VAE encoder compresses video frames from pixel space [F, C, H, W] into latent space [B, T, C', H', W'] with:
- Spatial downsampling: Factor of 8 in both height and width dimensions
- Temporal compression: Reduces the number of temporal frames
- Channel expansion: Increases channel count to the latent dimension (typically 16)
The latent representation captures the essential structure of the video in a lower-dimensional space. The scaling factor normalizes latent magnitudes to ensure compatibility with the diffusion model's noise schedule.
The encoding process:
- Rearrange input frames to the batch format expected by the VAE
- Pass through the VAE encoder to obtain the latent distribution
- Sample from the latent distribution (using the mean for deterministic encoding)
- Apply the scaling factor to normalize latent magnitudes
Usage
Use Video Encoding after video preprocessing and before DDIM inversion. The encoded latents serve as the starting point for the inversion process, which maps them to noise space.
Theoretical Basis
VAE encoding follows the variational inference framework:
Encoding:
z = scale_factor * VAE.encode(x).latent_dist.sample()
The latent space is regularized during VAE training to be approximately Gaussian, which enables diffusion model training in this space. The KL divergence regularization term ensures that the encoder outputs lie in a well-structured latent space:
L_VAE = L_reconstruction + beta * KL(q(z|x) || p(z))
where p(z) = N(0, I) is the standard Gaussian prior.
The scale_factor is computed during VAE training as the inverse of the standard deviation of latent activations on the training set. This normalization ensures that latent values have approximately unit variance, which is important for the diffusion model's noise schedule to work correctly.
Related Pages
- Implementation:Zai_org_CogVideo_Encode_Video_Frames -- Implementation of video encoding
- Zai_org_CogVideo_Video_Loading_and_Preprocessing -- Previous step: video preprocessing that produces input frames
- Zai_org_CogVideo_DDIM_Inversion -- Next step: inverting the encoded latents to noise space
- Zai_org_CogVideo_DDIM_Video_Export -- Decoding step that inverts the encoding