Workflow:Zai org CogVideo SAT Finetuning

Knowledge Sources	CogVideo SAT Fine-tuning Guide SwissArmyTransformer
Domains	Video_Generation, Fine_Tuning, SAT_Framework
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for fine-tuning CogVideoX models using the SwissArmyTransformer (SAT) framework with LoRA or full SFT, supporting single-GPU and multi-GPU DeepSpeed training.

Description

This workflow covers fine-tuning CogVideoX video generation models using the original SAT (SwissArmyTransformer) framework. SAT provides the core model architecture implementation including the DiffusionTransformer backbone, 3D VAE, and the complete SGM diffusion library. Training is configured through YAML files that define model architecture, training hyperparameters, and DeepSpeed settings. The SAT framework supports both LoRA (parameter-efficient) and full SFT fine-tuning for all CogVideoX variants, with integrated WebDataset loading for scalable data pipelines. This is the original training framework used by the CogVideoX research team.

Usage

Execute this workflow when you need fine-grained control over the CogVideoX model architecture during training, want to leverage SAT-specific features like context parallelism, or are working with the SAT model weight format. This is also the preferred path when building on the original research codebase rather than the Diffusers adaptation. Requires familiarity with YAML configuration and the SAT ecosystem.

Execution Steps

Step 1: Environment Setup

Install the SAT framework dependencies and configure the environment. The SAT module has its own requirements.txt separate from the main project. Set up the required Python packages including SwissArmyTransformer, DeepSpeed, OmegaConf, and video processing libraries (decord, imageio).

Key considerations:

SAT dependencies are in `sat/requirements.txt`, separate from root requirements
DeepSpeed is required for both single and multi-GPU training
CUDA toolkit version must be compatible with PyTorch and DeepSpeed

Step 2: Dataset Preparation

Prepare training data in the format expected by the SAT data pipeline. Videos are organized as individual MP4 files with corresponding text caption files. The SAT data loader supports both local file lists and WebDataset format for large-scale training. Videos are preprocessed to the target resolution and frame count during loading.

Key considerations:

Video resolution and frame count are defined in the model YAML config
The data pipeline supports bucket sampling for variable-length videos
WebDataset format enables streaming from cloud storage
Caption files contain one text description per video

Step 3: YAML Configuration

Select and customize the YAML configuration files for the target model variant and training mode. Two configs are needed: a model config (e.g., `cogvideox_2b_lora.yaml`) defining the architecture, and a training config (`sft.yaml`) defining hyperparameters and DeepSpeed settings. Model configs specify the DiffusionTransformer architecture, VAE parameters, sampler, and LoRA configuration when applicable.

Key considerations:

Model configs available: 2B, 5B, 1.5-5B for both T2V and I2V, with LoRA variants
The training config (`sft.yaml`) sets learning rate, batch size, save interval, and DeepSpeed options
LoRA configs specify rank, target modules (attention_dense, dense_h_to_4h, etc.)
Model weights path must be set in the model YAML config

Step 4: Model Initialization

Load the SATVideoDiffusionEngine which orchestrates the entire diffusion training pipeline. This initializes the DiffusionTransformer backbone (30 layers for 2B, 42 for 5B), the 3D VAE with context parallelism support, the T5 text encoder conditioner, and the SGM diffusion components (denoiser, loss, sampler, discretizer). For LoRA training, adapter layers are injected after loading the base weights.

Key considerations:

The SATVideoDiffusionEngine inherits from SAT's BaseModel
Context parallelism splits temporal dimension across GPUs
First-stage model (VAE) and conditioner (T5) are frozen during training
LoRA layers replace specified linear layers with low-rank decompositions

Step 5: Training Execution

Launch training via the SAT `training_main` function, which handles the distributed training loop. For single GPU, environment variables set world size to 1. For multi-GPU, `torchrun` coordinates distributed processes. Each training step: loads a video batch, encodes to latent space via VAE, encodes text via T5, applies the diffusion loss (noise prediction), and updates trainable parameters.

Key considerations:

Single GPU: `python train_video.py --base configs/model.yaml configs/sft.yaml`
Multi GPU: `torchrun --nproc_per_node=N train_video.py --base configs/model.yaml configs/sft.yaml`
Wandb logging can be enabled for training visualization
Video samples are periodically generated and saved during training

Step 6: Weight Export

After training, export the fine-tuned weights for use in either SAT or Diffusers inference. SAT checkpoints can be used directly for SAT-based inference. For Diffusers compatibility, use the weight conversion tools to convert SAT format to HuggingFace format. LoRA weights can be exported separately using the LoRA export utility.

Key considerations:

SAT to HF conversion: `tools/convert_weight_sat2hf.py`
LoRA export: `tools/export_sat_lora_weight.py`
DeepSpeed checkpoints require consolidation before conversion: `tools/convert_weight_deepspeed2hf.py`
Exported HF weights can be loaded by the Diffusers inference pipeline

Execution Diagram

GitHub URL

Workflow Repository