Principle:Zai org CogVideo SAT YAML Configuration
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | CogVideo |
| Domains | Configuration, Training_Infrastructure |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for configuring SAT model architecture and training parameters through composable YAML configuration files.
Description
SAT uses OmegaConf-based YAML configuration to define the complete training pipeline. The configuration system separates concerns into distinct top-level sections, each governing a different aspect of the pipeline. Multiple YAML files can be composed and merged, with later files overriding earlier ones, and CLI arguments overriding YAML values.
Configuration Structure
A complete SAT training configuration consists of three top-level YAML sections:
Model Section
The model section defines the entire model architecture through nested config dictionaries, each specifying a target (Python class path) and params (constructor arguments):
- network_config: DiT (Diffusion Transformer) backbone architecture, including hidden size, number of layers, attention heads, patch size, positional embedding, and LoRA configuration.
- denoiser_config: Noise prediction strategy, discretization scheme, weighting function, and scaling configuration.
- sampler_config: Sampling algorithm for inference (e.g., VPSDEDPMPP2MSampler with 50 steps), including guidance configuration.
- conditioner_config: Text conditioning pipeline, typically a frozen T5-XXL encoder with UCG (Unconditional Guidance) dropout rate.
- first_stage_config: 3D VAE for video encoding/decoding, with checkpoint path and encoder/decoder architecture.
- loss_fn_config: Diffusion training loss, including sigma sampling strategy and offset noise level.
- scale_factor: Latent space scaling factor (1.15258426 for 2B, 0.7 for 5B).
- lora_train: Boolean flag to enable LoRA training mode.
- not_trainable_prefixes: List of parameter name prefixes to freeze (e.g.,
['all']for LoRA,['first_stage_model', 'conditioner']for full fine-tuning).
Data Section
The data section specifies the dataset class and its parameters:
- target: Python class path (e.g.,
data_video.SFTDatasetordata_video.VideoDataset). - params: Constructor arguments including
video_size,fps,max_num_frames, andskip_frms_num.
Args Section
The args section maps directly to SAT training CLI arguments:
- Training control:
train_iters,epochs,mode(finetune),load(checkpoint path). - Evaluation:
eval_iters,eval_interval,eval_batch_size. - Checkpointing:
save(output directory),save_interval. - Logging:
log_interval,only_log_video_latents. - Data paths:
train_data,valid_data,split. - Parallelism:
model_parallel_size,checkpoint_activations.
DeepSpeed Section
The deepspeed section configures the distributed training optimizer:
- Batch size:
train_micro_batch_size_per_gpu,gradient_accumulation_steps. - ZeRO optimization:
zero_optimization.stage,cpu_offload,overlap_comm. - Precision:
bf16.enabledorfp16.enabled. - Optimizer: Type (e.g.,
sat.ops.FusedEmaAdam), learning rate, betas, weight decay. - Gradient clipping:
gradient_clippingthreshold.
Config Composition
The --base CLI argument accepts multiple YAML files that are merged in order:
python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml
In this example, cogvideox_2b_lora.yaml provides the model architecture (including LoRA configuration), and sft.yaml provides training hyperparameters, data configuration, and DeepSpeed settings. Values in sft.yaml override conflicting values from cogvideox_2b_lora.yaml.
Usage
Use YAML configuration to set up any SAT-based CogVideoX training run. The typical approach is to compose a base model config with a training config:
- LoRA fine-tuning:
--base configs/cogvideox_2b_lora.yaml configs/sft.yaml - Full fine-tuning:
--base configs/cogvideox_2b.yaml configs/sft.yaml - 5B model:
--base configs/cogvideox_5b_lora.yaml configs/sft.yaml - Image-to-Video:
--base configs/cogvideox_5b_i2v_lora.yaml configs/sft.yaml
Modify only the relevant YAML sections for your experiment; no code changes are needed for different model variants, training schedules, or data pipelines.
Theoretical Basis
Hierarchical Configuration
Hierarchical configuration (YAML files + CLI overrides) separates concerns across multiple dimensions:
- Model architecture is defined independently of training schedule and data pipeline.
- Training hyperparameters can be varied without modifying model or data configuration.
- Data pipeline configuration is decoupled from both model and training settings.
This separation enables systematic experiment management: different model sizes, LoRA ranks, learning rates, and datasets can be explored by composing and overriding individual config files rather than modifying code.
OmegaConf Merge Strategy
OmegaConf merges configurations using a last-writer-wins strategy at each key level. When multiple YAML files are loaded and merged via OmegaConf.merge(*configs), values from later files override values from earlier files for matching keys, while non-conflicting keys from all files are preserved. This enables incremental customization where a base config provides defaults and a training config overrides specific settings.
Factory Pattern via instantiate_from_config
Each component's YAML configuration specifies a target (fully qualified Python class path) and params (constructor arguments). The instantiate_from_config utility dynamically imports the class and constructs it with the given parameters. This factory pattern decouples the configuration of model components from their implementation, enabling new architectures to be integrated by only adding a YAML config entry.