Workflow:Zai org CogVideo SAT Finetuning
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Fine_Tuning, SAT_Framework |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for fine-tuning CogVideoX models using the SwissArmyTransformer (SAT) framework with LoRA or full SFT, supporting single-GPU and multi-GPU DeepSpeed training.
Description
This workflow covers fine-tuning CogVideoX video generation models using the original SAT (SwissArmyTransformer) framework. SAT provides the core model architecture implementation including the DiffusionTransformer backbone, 3D VAE, and the complete SGM diffusion library. Training is configured through YAML files that define model architecture, training hyperparameters, and DeepSpeed settings. The SAT framework supports both LoRA (parameter-efficient) and full SFT fine-tuning for all CogVideoX variants, with integrated WebDataset loading for scalable data pipelines. This is the original training framework used by the CogVideoX research team.
Usage
Execute this workflow when you need fine-grained control over the CogVideoX model architecture during training, want to leverage SAT-specific features like context parallelism, or are working with the SAT model weight format. This is also the preferred path when building on the original research codebase rather than the Diffusers adaptation. Requires familiarity with YAML configuration and the SAT ecosystem.
Execution Steps
Step 1: Environment Setup
Install the SAT framework dependencies and configure the environment. The SAT module has its own requirements.txt separate from the main project. Set up the required Python packages including SwissArmyTransformer, DeepSpeed, OmegaConf, and video processing libraries (decord, imageio).
Key considerations:
- SAT dependencies are in `sat/requirements.txt`, separate from root requirements
- DeepSpeed is required for both single and multi-GPU training
- CUDA toolkit version must be compatible with PyTorch and DeepSpeed
Step 2: Dataset Preparation
Prepare training data in the format expected by the SAT data pipeline. Videos are organized as individual MP4 files with corresponding text caption files. The SAT data loader supports both local file lists and WebDataset format for large-scale training. Videos are preprocessed to the target resolution and frame count during loading.
Key considerations:
- Video resolution and frame count are defined in the model YAML config
- The data pipeline supports bucket sampling for variable-length videos
- WebDataset format enables streaming from cloud storage
- Caption files contain one text description per video
Step 3: YAML Configuration
Select and customize the YAML configuration files for the target model variant and training mode. Two configs are needed: a model config (e.g., `cogvideox_2b_lora.yaml`) defining the architecture, and a training config (`sft.yaml`) defining hyperparameters and DeepSpeed settings. Model configs specify the DiffusionTransformer architecture, VAE parameters, sampler, and LoRA configuration when applicable.
Key considerations:
- Model configs available: 2B, 5B, 1.5-5B for both T2V and I2V, with LoRA variants
- The training config (`sft.yaml`) sets learning rate, batch size, save interval, and DeepSpeed options
- LoRA configs specify rank, target modules (attention_dense, dense_h_to_4h, etc.)
- Model weights path must be set in the model YAML config
Step 4: Model Initialization
Load the SATVideoDiffusionEngine which orchestrates the entire diffusion training pipeline. This initializes the DiffusionTransformer backbone (30 layers for 2B, 42 for 5B), the 3D VAE with context parallelism support, the T5 text encoder conditioner, and the SGM diffusion components (denoiser, loss, sampler, discretizer). For LoRA training, adapter layers are injected after loading the base weights.
Key considerations:
- The SATVideoDiffusionEngine inherits from SAT's BaseModel
- Context parallelism splits temporal dimension across GPUs
- First-stage model (VAE) and conditioner (T5) are frozen during training
- LoRA layers replace specified linear layers with low-rank decompositions
Step 5: Training Execution
Launch training via the SAT `training_main` function, which handles the distributed training loop. For single GPU, environment variables set world size to 1. For multi-GPU, `torchrun` coordinates distributed processes. Each training step: loads a video batch, encodes to latent space via VAE, encodes text via T5, applies the diffusion loss (noise prediction), and updates trainable parameters.
Key considerations:
- Single GPU: `python train_video.py --base configs/model.yaml configs/sft.yaml`
- Multi GPU: `torchrun --nproc_per_node=N train_video.py --base configs/model.yaml configs/sft.yaml`
- Wandb logging can be enabled for training visualization
- Video samples are periodically generated and saved during training
Step 6: Weight Export
After training, export the fine-tuned weights for use in either SAT or Diffusers inference. SAT checkpoints can be used directly for SAT-based inference. For Diffusers compatibility, use the weight conversion tools to convert SAT format to HuggingFace format. LoRA weights can be exported separately using the LoRA export utility.
Key considerations:
- SAT to HF conversion: `tools/convert_weight_sat2hf.py`
- LoRA export: `tools/export_sat_lora_weight.py`
- DeepSpeed checkpoints require consolidation before conversion: `tools/convert_weight_deepspeed2hf.py`
- Exported HF weights can be loaded by the Diffusers inference pipeline