Principle:Zai org CogVideo Temporal Autoencoding

Knowledge Sources	Video Diffusion Models Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
Domains	Video_Generation, Autoencoding, Temporal_Modeling
Last Updated	2026-02-10 00:00 GMT

Overview

Temporal autoencoding extends image autoencoders with temporal processing layers that capture inter-frame dynamics, enabling the compression and reconstruction of video sequences while preserving temporal coherence.

Description

Temporal autoencoding is an architectural pattern for adapting pretrained 2D image autoencoders to handle video data by inserting dedicated temporal processing layers alongside the existing spatial layers. Rather than building a 3D autoencoder from scratch, this approach leverages the strong spatial priors already learned by image models and adds lightweight temporal components that model motion and inter-frame dependencies.

The core insight is that video can be decomposed into spatial content (what appears in each frame) and temporal dynamics (how content evolves across frames). By processing these separately and blending the results, the model can learn temporal coherence without destroying the spatial quality of its pretrained image representations.

The blending is controlled by an alpha mixing factor that interpolates between spatial-only and temporally-augmented features. This factor can be fixed or learned during training. Starting with alpha near zero (fully spatial) and gradually allowing the model to learn temporal contributions enables smooth fine-tuning from image to video generation.

Usage

Apply temporal autoencoding when adapting a pretrained image autoencoder for video tasks. This principle is especially useful when high-quality image model weights are available and must be preserved, when temporal coherence between frames is required, or when training resources are limited and full 3D model training from scratch is impractical.

Theoretical Basis

Spatial-Temporal Decomposition

Given an input video feature tensor with frames indexed by t, the temporal autoencoding architecture processes each frame spatially first, then applies temporal mixing:

h_spatial(t) = SpatialBlock(x(t))
h_temporal = TemporalBlock(h_spatial(1), h_spatial(2), ..., h_spatial(T))
h_output(t) = alpha * h_spatial(t) + (1 - alpha) * h_temporal(t)

Where alpha is the merge factor controlling the blend between spatial and temporal pathways.

Alpha Merge Strategies

Two strategies are commonly used:

Fixed:    alpha = constant (registered as buffer, not trained)
Learned:  alpha = sigmoid(w) where w is a trainable parameter

The sigmoid ensures the learned alpha stays in [0, 1]. Initializing w = 0 gives alpha = 0.5 (equal blend); initializing with a large negative value gives alpha near 0 (spatial-dominant start).

Temporal Convolution

3D convolutions extend 2D spatial convolutions along the time axis:

Given input of shape (B, C, T, H, W):
  h = Conv2D(x(t)) for each frame t    [spatial features]
  h = reshape(h, (B, C, T, H, W))
  h = Conv3D(h, kernel=[k_t, k_h, k_w])  [temporal mixing]

The temporal kernel size controls the receptive field along the time axis.

Temporal Attention

Temporal attention operates across the time dimension with position-aware embeddings:

For each spatial position (h, w):
  tokens = {h(t, h, w) for t = 1..T}
  t_emb = sinusoidal_embedding(t)
  tokens = tokens + t_emb
  tokens = TransformerBlock(tokens)

This allows each spatial position to attend to all frames, capturing long-range temporal dependencies. Sinusoidal timestep embeddings provide the model with information about frame ordering.

Factory Method Pattern

The decoder uses a factory method design pattern where temporal variants replace base components:

base_conv    -> temporal_conv    (Conv2D -> AE3DConv with Conv3d mixing)
base_resblock -> temporal_resblock (ResnetBlock -> VideoResBlock with ResBlock3D)
base_attn    -> temporal_attn    (AttnBlock -> VideoBlock with VideoTransformerBlock)

Which components are replaced depends on the time_mode configuration.

Related Pages

Implementation:Zai_org_CogVideo_VideoDecoder_Temporal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment