Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo SAT Model Initialization

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources CogVideo
Domains Model_Architecture, Video_Diffusion
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for constructing a complete video diffusion model from configuration by instantiating and composing backbone, denoiser, sampler, conditioner, and VAE components.

Description

SAT model initialization uses the factory pattern (instantiate_from_config) to construct each model component from its YAML configuration. The SATVideoDiffusionEngine class serves as the top-level orchestrator, assembling all components into a coherent video diffusion training and inference pipeline.

Component Assembly

The initialization process constructs six core components from the model_config dictionary:

1. DiffusionTransformer Backbone (network_config)

The backbone is a DiT (Diffusion Transformer) architecture defined in dit_video_concat.py. It is constructed via instantiate_from_config(network_config) and wrapped with OPENAIUNETWRAPPER for compatibility with the denoising interface. The backbone processes latent video tensors conditioned on timestep embeddings and text embeddings.

Key architecture parameters from YAML:

  • hidden_size: 1920 (2B) or larger for 5B
  • num_layers: 30 (2B) or 42 (5B)
  • num_attention_heads: 30 (2B) or 48 (5B)
  • patch_size: 2
  • in_channels: 16 (T2V) or 32 (I2V)

The backbone also includes mixin modules for positional embedding, patch embedding, adaptive layer normalization, and the final output layer, plus an optional LoRA configuration.

2. Denoiser (denoiser_config)

The denoiser wraps the backbone to perform noise prediction. It applies the discretization schedule, weighting function, and scaling to convert between the model's noise prediction and the loss computation. The standard configuration uses DiscreteDenoiser with 1000 discrete noise indices, EpsWeighting, VideoScaling, and ZeroSNR-DDPM discretization.

3. Sampler (sampler_config)

The sampler implements the iterative denoising process for inference. The standard configuration uses VPSDEDPMPP2MSampler with 50 inference steps and DynamicCFG (Dynamic Classifier-Free Guidance) with a scale of 6.

4. Conditioner (conditioner_config)

The conditioner processes text prompts into conditioning embeddings. It uses a GeneralConditioner with a frozen T5-XXL text encoder (FrozenT5Embedder) that produces 4096-dimensional embeddings with a maximum sequence length of 226 tokens. Unconditional guidance (UCG) is applied with a dropout rate of 0.1 during training.

5. First Stage Model / VAE (first_stage_config)

The 3D VAE (VideoAutoencoderInferenceWrapper) encodes video frames into a lower-dimensional latent space and decodes latents back to pixel space. The VAE is always initialized in eval mode with frozen parameters (it is never fine-tuned). It uses a context-parallel encoder and decoder for memory-efficient processing of long videos.

6. Loss Function (loss_fn_config)

The loss function (VideoDiffusionLoss) computes the denoising training objective. It includes a sigma sampler that selects noise levels according to the ZeroSNR-DDPM discretization schedule with uniform sampling across 1000 timesteps.

Parameter Freezing

After construction, disable_untrainable_params freezes parameters based on the training mode:

  • LoRA training (lora_train=True): All parameters remain in the optimizer but non-LoRA parameters have their lr_scale set to 0, effectively freezing them while maintaining gradient flow for LoRA layers.
  • Full fine-tuning (lora_train=False): Parameters matching not_trainable_prefixes (typically first_stage_model and conditioner) have requires_grad set to False. Parameters containing matrix_A or matrix_B (SAT's LoRA naming convention) are always kept trainable.

Precision Handling

The engine determines the compute precision from args (fp16, bf16, or fp32) and stores it as self.dtype. This dtype is passed to the network config and used for all forward pass computations. The dtype string is also passed to the backbone constructor for internal precision management.

Usage

Use when setting up a SAT-based training run. The model is fully determined by the YAML config -- no code changes are needed for different model variants. To switch between CogVideoX-2B and 5B, between LoRA and full fine-tuning, or between text-to-video and image-to-video, only the YAML config files need to change.

The typical initialization flow is:

  1. Parse YAML configs via get_args.
  2. Pass args to SATVideoDiffusionEngine(args).
  3. Call model.disable_untrainable_params() to freeze appropriate parameters.
  4. Pass the model to training_main for distributed training.

Theoretical Basis

Factory Pattern (instantiate_from_config)

The factory pattern decouples model construction from usage. Each component is specified by its Python class path (target) and constructor arguments (params) in YAML. The instantiate_from_config utility dynamically imports the target class and instantiates it with the given params. This enables:

  • Swapping components (e.g., different samplers, denoisers, or VAEs) without code changes.
  • Composing complex architectures from independently developed modules.
  • Serializing the complete model specification as human-readable YAML.

Engine Composition

The SATVideoDiffusionEngine composes its components according to the video diffusion training paradigm:

  • Training path: conditioner(text) produces conditioning embeddings, first_stage_model.encode(video) produces latents, loss_fn(model, denoiser, conditioner, latents, batch) computes the denoising loss.
  • Inference path: conditioner(text) produces conditioning, sampler(denoiser, noise) iteratively denoises to produce latents, first_stage_model.decode(latents) produces video frames.

LoRA Parameter Efficiency

LoRA (Low-Rank Adaptation) decomposes weight updates into low-rank matrices: W' = W + A * B where A and B are much smaller than W. By freezing the original weights and only training the low-rank adapters, LoRA reduces trainable parameter count by orders of magnitude (e.g., from billions to millions) while maintaining model quality. SAT implements LoRA through its mixin system using matrix_A and matrix_B parameter naming.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment