Principle:Zai org CogVideo SAT Environment Setup
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | CogVideo, SwissArmyTransformer |
| Domains | Environment, Training_Infrastructure |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Principle of establishing a complete software environment with all dependencies for the SAT (SwissArmyTransformer) training framework used in CogVideoX fine-tuning.
Description
The SAT framework requires a specific set of dependencies to support its video diffusion model training pipeline. These dependencies span several functional categories:
Core Framework
- SwissArmyTransformer (>=0.4.12): The foundational training framework that extends PyTorch with model parallelism primitives, DeepSpeed integration, and a standardized training loop. SAT provides the
training_mainfunction that orchestrates distributed training, checkpointing, and evaluation.
- OmegaConf (>=2.3.0): A hierarchical configuration system that enables YAML-based experiment management. OmegaConf supports structured config composition, merging multiple YAML files, and CLI overrides, allowing the SAT pipeline to separate model architecture, training schedule, and data pipeline configurations.
Distributed Training
- DeepSpeed (>=0.15.3): A distributed training optimization library that provides ZeRO (Zero Redundancy Optimizer) state partitioning, mixed-precision training (fp16/bf16), and gradient accumulation. DeepSpeed is mandatory for SAT-based training as the framework delegates all distributed optimization to it.
- PyTorch Lightning (>=2.4.0): Provides additional training utilities and abstractions used by supporting modules in the SAT pipeline.
Data Loading
- decord (>=0.6.0): A high-performance video decoder that enables random-access frame reading without loading entire videos into memory. The SAT data pipeline uses decord's
VideoReaderto efficiently sample frames at target FPS from mp4 files.
- webdataset: Enables streaming from sharded tar archives for large-scale distributed training, used by the
VideoDatasetclass via theMetaDistributedWebDatasetbase class.
Tensor Operations and Transforms
- einops: Provides expressive tensor rearrangement operations used throughout the training and inference pipeline for reshaping video tensors between different dimension orderings.
- kornia (>=0.7.3): A differentiable computer vision library providing GPU-accelerated image transforms and augmentations.
- torchvision: Used for video resizing, cropping, and spatial transforms applied during data preprocessing.
Monitoring and Serialization
- wandb (>=0.18.5): Weights & Biases integration for experiment tracking, loss logging, and video sample visualization during training.
- safetensors (>=0.4.5): A safe, zero-copy tensor serialization format used for saving LoRA adapter weights in the PEFT-compatible format.
Additional Utilities
- beartype (>=0.19.0): Runtime type checking for function signatures.
- fsspec (>=2024.2.0): Filesystem specification providing a unified interface for local and remote file access.
- scipy (>=1.14.1): Scientific computing utilities used in various numerical operations.
- braceexpand: Shell-style brace expansion for specifying ranges of tar shard paths in WebDataset configurations.
- imageio: Used for saving generated video samples as mp4 files during evaluation logging.
Usage
Environment setup must be performed before any SAT-based training or inference workflow. It is a one-time operation per environment (virtual environment, container, or machine). The canonical method is to run pip install -r sat/requirements.txt from the repository root, which installs all dependencies with their minimum required versions.
Typical scenarios requiring environment setup:
- Initial setup: Before first SAT-based fine-tuning run on a new machine or container.
- Environment recreation: When reproducing training results in a new environment.
- Version upgrades: When updating the CogVideo repository, re-running the requirements install ensures compatibility.
Theoretical Basis
Dependency Management
Pinning minimum versions (via >= constraints) ensures reproducibility while allowing compatible updates. The requirements file specifies floor versions that have been tested for compatibility, preventing silent breakage from API changes in upstream packages.
SAT Architecture
SwissArmyTransformer extends PyTorch with model parallelism primitives through a mixin-based architecture. The framework partitions model parameters across GPUs using tensor model parallelism (splitting attention heads and MLP layers) and integrates with DeepSpeed for data-parallel training with ZeRO optimizer state partitioning. This dual parallelism strategy enables training models that exceed single-GPU memory while maintaining efficient gradient communication.
DeepSpeed ZeRO Optimizer Partitioning
DeepSpeed's ZeRO optimization partitions optimizer states (Stage 1), gradients (Stage 2), and optionally parameters (Stage 3) across data-parallel ranks. The SAT pipeline typically uses Stage 2, which partitions both optimizer states and gradients, reducing per-GPU memory consumption by a factor proportional to the data-parallel degree while maintaining training throughput through overlapped communication and computation.