Principle:Zai org CogVideo 3D Video Encoding

Knowledge Sources	Neural Discrete Representation Learning Denoising Diffusion Probabilistic Models
Domains	Video_Generation, Autoencoding, Representation_Learning
Last Updated	2026-02-10 00:00 GMT

Overview

3D video encoding compresses spatiotemporal video data into a compact latent representation by jointly downsampling spatial and temporal dimensions through hierarchical convolutional networks with causal temporal constraints.

Description

Video encoding for generative models must address two distinct axes of redundancy: spatial redundancy within each frame and temporal redundancy across frames. A 3D video encoder tackles both simultaneously by operating on 5D tensors (batch, channels, time, height, width) using volumetric convolutions.

The causal constraint is a critical design principle in video encoding for generation tasks. Unlike bidirectional models that can look at future frames, a causal encoder ensures that the latent representation at any temporal position depends only on the current and preceding frames. This is achieved through asymmetric temporal padding: padding is applied only on the past side of the time axis, so convolution kernels never access future information. This property aligns the encoder with autoregressive generation paradigms where frames are produced sequentially.

Temporal compression is typically applied at the earliest resolution levels of the encoder hierarchy, where feature maps are at their largest spatial size. The first frame is treated specially and never temporally pooled, preserving a clean reference for the start of the video sequence. Subsequent frames are downsampled using average pooling along the time axis.

Spatial compression follows the same multi-resolution design used in image autoencoders: each resolution level applies residual blocks followed by strided convolutions or pooling to halve the spatial dimensions. Self-attention layers may be inserted at specific resolutions to capture long-range spatial dependencies within each frame.

Usage

Apply 3D video encoding when building latent-space generative models for video. It is essential when the downstream diffusion or autoregressive model operates in a compressed latent space rather than directly on pixel values, reducing computational cost while preserving spatiotemporal structure.

Theoretical Basis

The encoding process can be described as a mapping:

E: R^{B x C x T x H x W} -> R^{B x D x T' x H' x W'}

where T' = T / t_compress, H' = H / s_compress, W' = W / s_compress, and D is the latent channel dimensionality.

The causal convolution for a 3D kernel of temporal size k_t pads the input by (k_t - 1) on the past side and 0 on the future side:

CausalPad(x, k_t) = Pad(x, left_t=(k_t - 1), right_t=0)
Conv3D(CausalPad(x))

Temporal downsampling preserves the first frame:

x_first = x[:, :, 0:1, :, :]
x_rest  = AvgPool1D(x[:, :, 1:, :, :], kernel=2, stride=2)
x_down  = Concat([x_first, x_rest], dim=time)

The multi-resolution hierarchy applies num_res_blocks residual blocks per level, where each residual block follows:

h = CausalConv3D(Swish(GroupNorm(x)))
h = CausalConv3D(Swish(GroupNorm(h)))
output = x + h  (with optional 1x1 shortcut if channels differ)

The final output is produced by GroupNorm, Swish activation, and a final CausalConv3D projection to 2 * z_channels (for reparameterization as mean and log-variance) or z_channels directly.

Related Pages

Implementation:Zai_org_CogVideo_MoVQ_Encoder3D

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment