Principle:Zai org CogVideo VQVAE Architecture

Knowledge Sources	Neural Discrete Representation Learning Taming Transformers for High-Resolution Image Synthesis
Domains	Video_Generation, Autoencoding, Representation_Learning
Last Updated	2026-02-10 00:00 GMT

Overview

The VQ-VAE (Vector Quantized Variational Autoencoder) architecture encodes images into a discrete latent space through multi-resolution convolutional encoding, vector quantization, and symmetric convolutional decoding with residual blocks and self-attention.

Description

A VQ-VAE consists of three main components: an encoder that maps input images to continuous latent representations, a codebook that quantizes these representations to discrete codes, and a decoder that reconstructs images from the quantized codes.

The encoder and decoder follow a hierarchical multi-resolution design inspired by the architectures from Denoising Diffusion Probabilistic Models and the taming-transformers codebase. The encoder progressively reduces spatial resolution through a series of resolution levels, each containing multiple residual blocks. Between levels, strided convolutions halve the spatial dimensions. The decoder mirrors this structure with progressive upsampling.

Residual blocks are the fundamental building blocks, following the pattern: GroupNorm, Swish activation, 3x3 convolution, GroupNorm, Swish, dropout, 3x3 convolution, plus a residual shortcut connection. When input and output channel counts differ, a 1x1 convolution (or optionally a 3x3 convolution) adapts the shortcut path.

Self-attention is applied at specific resolution levels to capture long-range spatial dependencies. The attention mechanism uses 1x1 convolutions for Q, K, V projections, computes scaled dot-product attention over flattened spatial positions, and adds the result back to the input as a residual.

A middle section at the bottleneck resolution contains two residual blocks with an attention block between them, providing a processing stage at the most compressed representation before quantization.

Usage

Use this architecture when building image or video autoencoders that require discrete latent representations. The VQ-VAE encoder-decoder pair is typically trained jointly with a codebook and serves as the first stage in a two-stage generation pipeline, where a second model (e.g., a transformer or diffusion model) operates on the discrete codes.

Theoretical Basis

The encoding process maps an image to a spatial grid of continuous vectors:

z_e = Encoder(x)   # shape: (B, D, H', W')

Vector quantization replaces each spatial position with the nearest codebook entry:

z_q[i,j] = argmin_k || z_e[i,j] - e_k ||^2

where e_k are the codebook vectors.

The residual block computation follows:

h = Conv_3x3(Swish(GroupNorm_32(x)))
h = Conv_3x3(Dropout(Swish(GroupNorm_32(h))))
if channels_in != channels_out:
    x = Conv_1x1(x)  # or Conv_3x3 shortcut
output = x + h

The self-attention block operates on flattened spatial dimensions:

h = GroupNorm(x)
Q = Conv_1x1(h)  # (B, C, H*W)
K = Conv_1x1(h)  # (B, C, H*W)
V = Conv_1x1(h)  # (B, C, H*W)

# Scale queries for fp16 stability
Q = Q * C^(-0.5)
W = softmax(Q^T @ K)     # (B, H*W, H*W)
A = V @ W^T              # (B, C, H*W)
output = x + Conv_1x1(A)

Downsampling uses strided convolutions with asymmetric padding:

x_padded = Pad(x, right=1, bottom=1)
x_down = Conv_3x3(x_padded, stride=2)  # halves spatial dims

Upsampling uses interpolation followed by optional convolution:

x_up = NearestInterpolate(x, scale=2)
x_up = Conv_3x3(x_up)  # refine upsampled features

The encoder output supports the reparameterization trick by optionally doubling the latent channels to produce both mean and log-variance.

Related Pages

Implementation:Zai_org_CogVideo_VQVAE_Blocks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment