Principle:Zai org CogVideo MoVQ Decoder Architecture

Knowledge Sources	MoVQ: Modulating Quantized Vectors for Design-General Image Generation Semantic Image Synthesis with Spatially-Adaptive Normalization
Domains	Video_Generation, Autoencoding, Image_Reconstruction
Last Updated	2026-02-10 00:00 GMT

Overview

The MoVQ decoder architecture reconstructs images from quantized latent codes by modulating feature normalization with the quantized vectors throughout the decoding hierarchy, enabling high-fidelity reconstruction.

Description

Standard VQ-VAE decoders take quantized latent codes as input and progressively upsample them back to pixel space. The MoVQ (Modulating Quantized Vectors) approach enhances this process by making the quantized code tensor available as a conditioning signal at every normalization layer in the decoder, not just at the input. This is achieved through spatially-adaptive normalization (inspired by SPADE), where the statistics of each normalization layer are modulated by affine parameters derived from the quantized codes.

The key insight is that simply feeding quantized codes as input to the decoder provides only an initial conditioning signal that may be diluted through deep network layers. By re-injecting the quantized information at every normalization point, the decoder maintains a strong connection to the discrete codebook representation throughout the reconstruction process, improving output fidelity.

The decoder follows a multi-resolution upsampling architecture with a middle bottleneck section containing residual and attention blocks, followed by progressive upsampling stages. Each stage uses nearest-neighbor interpolation with optional learned convolution refinement to double the spatial resolution.

Usage

Apply this architecture when building decoders for VQ-VAE systems where high-fidelity reconstruction from discrete codes is critical. The spatially-adaptive normalization is particularly beneficial when the quantized codes carry important spatial structure that should influence reconstruction at all decoder depths.

Theoretical Basis

The core mechanism is spatially-adaptive normalization conditioned on quantized codes. Given a feature tensor f and quantized code tensor zq:

zq_resized = Interpolate(zq, size=f.spatial_size, mode="nearest")
zq_resized = Conv(zq_resized)  [optional additional convolution]

gamma = Conv_y(zq_resized)   # learned per-channel scale
beta  = Conv_b(zq_resized)   # learned per-channel bias

output = GroupNorm(f) * gamma + beta

This replaces standard GroupNorm throughout the decoder. The gamma and beta parameters are spatially varying, allowing different regions of the feature map to be normalized differently based on the local quantized code.

The decoder architecture follows a symmetric structure to the encoder:

# Middle section at lowest resolution
h = ResBlock(Conv_in(z), temb=None, zq)
h = SelfAttention(h, zq)
h = ResBlock(h, temb=None, zq)

# Progressive upsampling
for level in reversed(resolution_levels):
    for block in range(num_res_blocks + 1):
        h = ResBlock(h, temb=None, zq)
        if at_attention_resolution:
            h = SelfAttention(h, zq)
    if not at_finest_level:
        h = Upsample(h)

# Final projection
output = Conv_out(Swish(SpatialNorm(h, zq)))

The self-attention mechanism at designated resolutions computes:

Q, K, V = linear_projections(SpatialNorm(h, zq))
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
output = h + projection(Attention(Q, K, V))

Related Pages

Implementation:Zai_org_CogVideo_MoVQ_2D_Modules

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment