Principle:Zai org CogVideo SAT YAML Configuration

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	CogVideo
Domains	Configuration, Training_Infrastructure
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for configuring SAT model architecture and training parameters through composable YAML configuration files.

Description

SAT uses OmegaConf-based YAML configuration to define the complete training pipeline. The configuration system separates concerns into distinct top-level sections, each governing a different aspect of the pipeline. Multiple YAML files can be composed and merged, with later files overriding earlier ones, and CLI arguments overriding YAML values.

Configuration Structure

A complete SAT training configuration consists of three top-level YAML sections:

Model Section

The model section defines the entire model architecture through nested config dictionaries, each specifying a target (Python class path) and params (constructor arguments):

network_config: DiT (Diffusion Transformer) backbone architecture, including hidden size, number of layers, attention heads, patch size, positional embedding, and LoRA configuration.
denoiser_config: Noise prediction strategy, discretization scheme, weighting function, and scaling configuration.
sampler_config: Sampling algorithm for inference (e.g., VPSDEDPMPP2MSampler with 50 steps), including guidance configuration.
conditioner_config: Text conditioning pipeline, typically a frozen T5-XXL encoder with UCG (Unconditional Guidance) dropout rate.
first_stage_config: 3D VAE for video encoding/decoding, with checkpoint path and encoder/decoder architecture.
loss_fn_config: Diffusion training loss, including sigma sampling strategy and offset noise level.
scale_factor: Latent space scaling factor (1.15258426 for 2B, 0.7 for 5B).
lora_train: Boolean flag to enable LoRA training mode.
not_trainable_prefixes: List of parameter name prefixes to freeze (e.g., ['all'] for LoRA, ['first_stage_model', 'conditioner'] for full fine-tuning).

Data Section

The data section specifies the dataset class and its parameters:

target: Python class path (e.g., data_video.SFTDataset or data_video.VideoDataset).
params: Constructor arguments including video_size, fps, max_num_frames, and skip_frms_num.

Args Section

The args section maps directly to SAT training CLI arguments:

Training control: train_iters, epochs, mode (finetune), load (checkpoint path).
Evaluation: eval_iters, eval_interval, eval_batch_size.
Checkpointing: save (output directory), save_interval.
Logging: log_interval, only_log_video_latents.
Data paths: train_data, valid_data, split.
Parallelism: model_parallel_size, checkpoint_activations.

DeepSpeed Section

The deepspeed section configures the distributed training optimizer:

Batch size: train_micro_batch_size_per_gpu, gradient_accumulation_steps.
ZeRO optimization: zero_optimization.stage, cpu_offload, overlap_comm.
Precision: bf16.enabled or fp16.enabled.
Optimizer: Type (e.g., sat.ops.FusedEmaAdam), learning rate, betas, weight decay.
Gradient clipping: gradient_clipping threshold.

Config Composition

The --base CLI argument accepts multiple YAML files that are merged in order:

python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml

In this example, cogvideox_2b_lora.yaml provides the model architecture (including LoRA configuration), and sft.yaml provides training hyperparameters, data configuration, and DeepSpeed settings. Values in sft.yaml override conflicting values from cogvideox_2b_lora.yaml.

Usage

Use YAML configuration to set up any SAT-based CogVideoX training run. The typical approach is to compose a base model config with a training config:

LoRA fine-tuning: --base configs/cogvideox_2b_lora.yaml configs/sft.yaml
Full fine-tuning: --base configs/cogvideox_2b.yaml configs/sft.yaml
5B model: --base configs/cogvideox_5b_lora.yaml configs/sft.yaml
Image-to-Video: --base configs/cogvideox_5b_i2v_lora.yaml configs/sft.yaml

Modify only the relevant YAML sections for your experiment; no code changes are needed for different model variants, training schedules, or data pipelines.

Theoretical Basis

Hierarchical Configuration

Hierarchical configuration (YAML files + CLI overrides) separates concerns across multiple dimensions:

Model architecture is defined independently of training schedule and data pipeline.
Training hyperparameters can be varied without modifying model or data configuration.
Data pipeline configuration is decoupled from both model and training settings.

This separation enables systematic experiment management: different model sizes, LoRA ranks, learning rates, and datasets can be explored by composing and overriding individual config files rather than modifying code.

OmegaConf Merge Strategy

OmegaConf merges configurations using a last-writer-wins strategy at each key level. When multiple YAML files are loaded and merged via OmegaConf.merge(*configs), values from later files override values from earlier files for matching keys, while non-conflicting keys from all files are preserved. This enables incremental customization where a base config provides defaults and a training config overrides specific settings.

Factory Pattern via instantiate_from_config

Each component's YAML configuration specifies a target (fully qualified Python class path) and params (constructor arguments). The instantiate_from_config utility dynamically imports the class and constructs it with the given parameters. This factory pattern decouples the configuration of model components from their implementation, enabling new architectures to be integrated by only adding a YAML config entry.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment