Principle:Zai org CogVideo Video Tokenization
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Representation_Learning, Discrete_Tokenization |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Video tokenization is the process of compressing a continuous video signal into a compact sequence of discrete tokens through an encoder-quantizer-decoder architecture, enabling downstream generation models to operate in a tractable discrete latent space.
Description
Raw video data is extremely high-dimensional: a short clip of 16 frames at 256x256 resolution with 3 color channels contains over 3 million values. Operating directly on this space is computationally prohibitive for generation models. Video tokenization addresses this by learning a compressed, discrete representation.
The core architecture follows an encoder-quantizer-decoder pattern:
Encoder: A neural network (typically using 3D convolutions) progressively downsamples the video in both spatial and temporal dimensions. Each downsampling step reduces the resolution while increasing the channel dimension, capturing increasingly abstract features. Causal temporal convolutions are often used to ensure that the encoding of each frame only depends on current and past frames, not future ones. This causal structure is important for autoregressive generation.
Quantizer: The continuous encoder output is mapped to discrete tokens. This discretization step is what transforms the continuous latent representation into a sequence of tokens from a finite vocabulary. Different quantization strategies exist:
- Vector Quantization (VQ): Each latent vector is mapped to the nearest entry in a learned codebook.
- Lookup-Free Quantization (LFQ): Each dimension is independently thresholded to binary values, avoiding explicit codebook lookup.
- Finite Scalar Quantization (FSQ): Each dimension is quantized to a fixed set of levels.
Decoder: A mirror of the encoder that progressively upsamples the quantized representation back to pixel space, using transposed convolutions or depth-to-space operations.
Training Objectives: The tokenizer is trained to minimize reconstruction error between the original and decoded video, with additional losses for:
- Perceptual quality (via feature-matching with pre-trained networks)
- Adversarial sharpness (via discriminator networks)
- Codebook utilization (via entropy regularization)
Key design choices include the compression ratio (how many pixels are represented by each token), the codebook size (vocabulary size of the discrete space), and whether to use separate first-frame encoding (treating the initial frame differently for better temporal modeling).
Once trained, the tokenizer converts video to a compact token sequence that is orders of magnitude smaller than the original pixel representation, making it feasible for autoregressive transformers or diffusion models to generate video by predicting token sequences.
Usage
Apply video tokenization as the first stage in any two-stage video generation pipeline. The tokenizer is trained once and then frozen while a second-stage model (autoregressive transformer, diffusion model, or language model) learns to generate in the discrete token space.
Theoretical Basis
Spatial and temporal compression:
Given a video V of shape (C, T, H, W), the encoder produces a latent representation Z of shape (D, T/r_t, H/r_s, W/r_s), where r_t and r_s are the temporal and spatial downsampling factors respectively. The total compression ratio is:
compression = (C * T * H * W) / (D * (T/r_t) * (H/r_s) * (W/r_s))
= (C * r_t * r_s^2) / D
Causal temporal convolution:
For a 1D convolution with kernel size k and stride s, causal padding ensures only past context is used:
padding = (k - 1, 0) (left-pad only, no right-pad) output_t = conv(input[t-k+1 : t+1])
This extends to 3D by applying causal padding only along the time dimension while using symmetric padding for spatial dimensions.
Discrete bottleneck:
The quantizer maps continuous encoder output z to discrete code c:
c = argmin_i ||z - e_i||^2 (VQ: nearest codebook entry) c = sign(z) (LFQ: binary thresholding)
Because argmin and sign are not differentiable, straight-through estimation is used: the forward pass uses discrete values, but gradients pass through as if the quantization step were an identity function:
z_hat = z + stop_gradient(quantize(z) - z)
Reconstruction objective:
L_recon = ||V - Decode(Quantize(Encode(V)))||^2
The total training loss combines reconstruction with perceptual, adversarial, and quantizer auxiliary losses as described in the related Video Reconstruction Loss and Perceptual Adversarial Loss principles.
Information capacity:
With a codebook of size K and spatial/temporal dimensions (T', H', W'), the tokenizer can represent:
K^(T' * H' * W') possible distinct videos
The effective bits per token is log2(K). For example, a codebook size of 1024 provides 10 bits per token.