Principle:Zai org CogVideo Autoencoder Architecture

Knowledge Sources	High-Resolution Image Synthesis with Latent Diffusion Models Auto-Encoding Variational Bayes
Domains	Video_Generation, Autoencoding, Latent_Diffusion
Last Updated	2026-02-10 00:00 GMT

Overview

The autoencoder architecture in latent diffusion models compresses high-resolution images into a compact latent space using a hierarchical convolutional encoder-decoder, enabling the diffusion process to operate efficiently on lower-dimensional representations.

Description

Latent diffusion models separate the generative process into two stages: a perceptual compression stage handled by the autoencoder, and a semantic composition stage handled by the diffusion model. The autoencoder's role is to learn a perceptually equivalent but computationally more tractable representation of the data.

The architecture follows a hierarchical U-Net-inspired design with several key properties:

Multi-resolution encoding uses a series of resolution levels, each containing residual blocks and optional self-attention layers. Between levels, strided convolutions reduce spatial dimensions by a factor of 2. The channel count increases at deeper levels according to a configurable multiplier schedule (e.g., 1, 2, 4, 8), allowing the network to increase its representational capacity as spatial resolution decreases.

Configurable attention backends allow trading off between computation speed and memory usage. Modern implementations support PyTorch 2.0's scaled dot-product attention, xformers memory-efficient attention, and linear attention approximations. The attention backend can be selected independently from the rest of the architecture.

Factory method pattern in the decoder allows subclasses to override the attention, residual block, and convolution implementations without modifying the overall decoder structure. This extensibility enables custom variants for specific use cases.

Timestep-conditioned U-Net variant (the full Model class) adds sinusoidal timestep embeddings and skip connections between the encoder and decoder paths, making it suitable for use as a standalone diffusion model backbone rather than just a VAE component.

Usage

Apply this architecture when implementing the VAE stage of a latent diffusion pipeline. The encoder maps images to a latent space where diffusion operates, and the decoder maps back to pixel space. The compression ratio is determined by the number of resolution levels and the channel multiplier schedule.

Theoretical Basis

The autoencoder learns an encoding function and decoding function:

z = E(x)      # Encoder: R^{H x W x 3} -> R^{h x w x d}
x' = D(z)     # Decoder: R^{h x w x d} -> R^{H x W x 3}

where h = H / f, w = W / f, with f being the total downsampling factor.

The encoder maps through a hierarchy of resolution levels:

h_0 = Conv_3x3(x)
for level in range(L):
    for block in range(num_res_blocks):
        h = ResBlock(h)
        if at_attention_resolution:
            h = SelfAttention(h)
    if not last_level:
        h = Downsample(h)  # halve spatial dims

# Middle section
h = ResBlock(h)
h = SelfAttention(h)
h = ResBlock(h)

# Project to latent
z = Conv_3x3(Swish(GroupNorm(h)))

The attention mechanism supports multiple implementations:

# Vanilla (PyTorch 2.0+):
Q, K, V = reshape(h, "b c h w -> b 1 (h w) c")
output = scaled_dot_product_attention(Q, K, V)

# xformers memory-efficient:
Q, K, V = reshape(h, "b c h w -> b (h w) c")
output = xformers.ops.memory_efficient_attention(Q, K, V)

# Linear attention:
output = LinearAttention(h)  # O(N) approximation

For the U-Net Model variant, timestep conditioning is injected via:

t_emb = sinusoidal_embedding(t, dim=ch)
t_emb = MLP(t_emb)  # Two-layer with Swish

# In each ResBlock:
h = h + Linear(Swish(t_emb))[:, :, None, None]

The U-Net skip connections concatenate encoder features with decoder features:

h_dec = Concat([h_dec, h_enc_skip], dim=channels)
h_dec = ResBlock(h_dec)

Related Pages

Implementation:Zai_org_CogVideo_Autoencoder_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment