Environment:Zai org CogVideo Diffusers Finetuning Environment
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Deep_Learning, Finetuning |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Ubuntu/Linux GPU environment with Python 3.10+, PyTorch >= 2.5.1, CUDA (bf16/fp16), HuggingFace Diffusers >= 0.32.2, DeepSpeed >= 0.16.4, and HuggingFace Accelerate >= 1.11.0 for LoRA and SFT fine-tuning of CogVideoX video diffusion models.
Description
This environment provides the full stack for fine-tuning CogVideoX models using the HuggingFace Diffusers-based training pipeline. It supports both LoRA (Low-Rank Adaptation) and full SFT (Supervised Fine-Tuning) with distributed training via HuggingFace Accelerate and DeepSpeed ZeRO Stage 2/3. The pipeline uses bf16 mixed precision for CogVideoX-5B models and fp16 for CogVideoX-2B. GPU memory management relies on gradient checkpointing, VAE slicing/tiling, and precomputed latent caching.
Usage
Use this environment for any LoRA fine-tuning or full SFT workflow on CogVideoX models using the Diffusers-based pipeline in the `finetune/` package. This is the mandatory prerequisite for running the training scripts (`train_ddp_t2v.sh`, `train_ddp_i2v.sh`, `train_zero_t2v.sh`, `train_zero_i2v.sh`).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | CUDA-compatible OS required |
| Hardware (LoRA 2B) | NVIDIA GPU >= 16GB VRAM | e.g., RTX 4080 |
| Hardware (LoRA 5B) | NVIDIA GPU >= 24GB VRAM | e.g., RTX 4090 |
| Hardware (LoRA 1.5-5B) | NVIDIA GPU >= 35GB VRAM | e.g., A100 40GB |
| Hardware (SFT 2B) | NVIDIA GPU >= 36GB VRAM | e.g., A100 40GB (DDP) |
| Hardware (SFT 5B ZeRO-3) | 8x NVIDIA GPU >= 28GB VRAM | e.g., 8x RTX 5090 |
| Python | >= 3.10 | pyproject.toml targets py310 |
| CUDA | >= 11.0 (bf16 support required for 5B) | Ampere+ recommended |
| Disk | Sufficient for latent cache | Cached latents stored on disk per resolution |
Dependencies
System Packages
- NVIDIA CUDA Toolkit (compatible with PyTorch build)
- `ffmpeg` (for video I/O via imageio-ffmpeg)
Python Packages
- `torch` >= 2.5.1
- `torchvision` >= 0.20.1
- `diffusers` >= 0.32.2
- `transformers` >= 4.46.3
- `accelerate` >= 1.11.0
- `deepspeed` >= 0.16.4
- `peft` >= 0.13.2
- `pydantic` >= 2.10.6
- `datasets` >= 2.14.4
- `decord` >= 0.6.0
- `opencv-python` >= 4.11.0.86
- `sentencepiece` >= 0.2.0
- `numpy` == 1.26.0 (pinned)
- `imageio` >= 2.37.0
- `imageio-ffmpeg` >= 0.6.0
- `moviepy` >= 2.2.1
- `wandb` >= 0.19.7
Optional Packages
- `torchao` — For 4-bit/8-bit optimizers or CPU offload optimizer (install with `USE_CPP=0 pip install torchao`)
- `bitsandbytes` — For 8-bit Adam/AdamW optimizers
- `prodigyopt` — For Prodigy self-adaptive optimizer
- `came-pytorch` — For CAME optimizer
Credentials
No API tokens are required for the core finetuning pipeline. Models are loaded from public HuggingFace repositories.
Optional:
- `OPENAI_API_KEY`: For prompt enhancement via OpenAI API (used in inference/demo scripts only)
- `WANDB_API_KEY`: For Weights & Biases experiment logging (if wandb tracking is enabled)
Quick Install
# Install core dependencies
pip install torch>=2.5.1 torchvision>=0.20.1 diffusers>=0.32.2 transformers>=4.46.3 \
accelerate>=1.11.0 deepspeed>=0.16.4 peft>=0.13.2 pydantic>=2.10.6 \
datasets>=2.14.4 decord>=0.6.0 opencv-python>=4.11.0.86 sentencepiece>=0.2.0 \
numpy==1.26.0 imageio>=2.37.0 imageio-ffmpeg>=0.6.0 moviepy>=2.2.1 wandb>=0.19.7
# Optional optimizers
pip install bitsandbytes prodigyopt came-pytorch
USE_CPP=0 pip install torchao
Code Evidence
Environment variable setup from `finetune/train_ddp_t2v.sh:1-3`:
export TOKENIZERS_PARALLELISM=false
Mixed precision validation from `finetune/schemas/args.py:170-175`:
@field_validator("mixed_precision")
def validate_mixed_precision(cls, v: str, info: ValidationInfo) -> str:
if v == "fp16" and "cogvideox-2b" not in str(info.data.get("model_path", "")).lower():
logging.warning(
"All CogVideoX models except cogvideox-2b were trained with bfloat16. "
"Using fp16 precision may lead to training instability."
)
return v
MPS bfloat16 check from `finetune/trainer.py:230-234`:
if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
raise ValueError(
"Mixed precision training with bfloat16 is not supported on MPS. "
"Please use fp16 (recommended) or fp32 instead."
)
DeepSpeed ZeRO-2 config from `finetune/configs/zero2.yaml:1-10`:
{
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"reduce_scatter": true
},
"bf16": { "enabled": true },
"train_micro_batch_size_per_gpu": 1
}
Accelerate config targeting 8 GPUs from `finetune/accelerate_config.yaml`:
num_processes: 8
gpu_ids: "0,1,2,3,4,5,6,7"
deepspeed_config:
deepspeed_config_file: configs/zero2.yaml
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Mixed precision training with bfloat16 is not supported on MPS` | Running bf16 on Apple Silicon (pytorch#99272) | Use `--mixed_precision fp16` or `fp32` |
| `All CogVideoX models except cogvideox-2b were trained with bfloat16` | Using fp16 with a 5B model | Switch to `--mixed_precision bf16` |
| `validation_steps must be a multiple of checkpointing_steps` | Misaligned validation/checkpoint intervals | Set validation_steps as a multiple of checkpointing_steps |
| `For cogvideox-5b models, height must be 480 and width must be 720` | Wrong resolution for 5B model | Use 480x720 for CogVideoX-5B |
| CUDA OOM during training | Insufficient VRAM | Enable gradient checkpointing, VAE slicing/tiling, or use DeepSpeed ZeRO-3 |
| NCCL timeout during distributed training | Latent precomputation takes too long | Increase `nccl_timeout` (default 1800s) |
Compatibility Notes
- CogVideoX-2B: Supports fp16 mixed precision only. Use `--mixed_precision fp16`.
- CogVideoX-5B / CogVideoX1.5-5B: Requires bf16 mixed precision. Using fp16 leads to training instability.
- Apple Silicon (MPS): bf16 not supported due to pytorch#99272. AMP is automatically disabled on MPS. fp16 or fp32 only.
- Multi-GPU DDP: Requires `find_unused_parameters=True` because LoRA freezes most parameters.
- DeepSpeed ZeRO-3: Validation is slow; recommended to disable validation when using ZeRO-3.
- numpy: Pinned to 1.26.0 across the project. Do not upgrade.
- TOKENIZERS_PARALLELISM: Must be set to `false` to prevent HuggingFace tokenizer issues in distributed training.
Related Pages
- Implementation:Zai_org_CogVideo_T2V_I2V_Dataset_Loader
- Implementation:Zai_org_CogVideo_Args_Parse_Args
- Implementation:Zai_org_CogVideo_CogVideoX_LoRA_Trainer_Load_Components
- Implementation:Zai_org_CogVideo_Accelerator_Setup
- Implementation:Zai_org_CogVideo_Trainer_Train
- Implementation:Zai_org_CogVideo_Trainer_Checkpoint_Validate
- Implementation:Zai_org_CogVideo_Load_Lora_Weights_Fuse