Environment:Zai org CogVideo Diffusers Finetuning Environment

Knowledge Sources	CogVideo Diffusers DeepSpeed
Domains	Video_Generation, Deep_Learning, Finetuning
Last Updated	2026-02-10 02:00 GMT

Overview

Ubuntu/Linux GPU environment with Python 3.10+, PyTorch >= 2.5.1, CUDA (bf16/fp16), HuggingFace Diffusers >= 0.32.2, DeepSpeed >= 0.16.4, and HuggingFace Accelerate >= 1.11.0 for LoRA and SFT fine-tuning of CogVideoX video diffusion models.

Description

This environment provides the full stack for fine-tuning CogVideoX models using the HuggingFace Diffusers-based training pipeline. It supports both LoRA (Low-Rank Adaptation) and full SFT (Supervised Fine-Tuning) with distributed training via HuggingFace Accelerate and DeepSpeed ZeRO Stage 2/3. The pipeline uses bf16 mixed precision for CogVideoX-5B models and fp16 for CogVideoX-2B. GPU memory management relies on gradient checkpointing, VAE slicing/tiling, and precomputed latent caching.

Usage

Use this environment for any LoRA fine-tuning or full SFT workflow on CogVideoX models using the Diffusers-based pipeline in the `finetune/` package. This is the mandatory prerequisite for running the training scripts (`train_ddp_t2v.sh`, `train_ddp_i2v.sh`, `train_zero_t2v.sh`, `train_zero_i2v.sh`).

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	CUDA-compatible OS required
Hardware (LoRA 2B)	NVIDIA GPU >= 16GB VRAM	e.g., RTX 4080
Hardware (LoRA 5B)	NVIDIA GPU >= 24GB VRAM	e.g., RTX 4090
Hardware (LoRA 1.5-5B)	NVIDIA GPU >= 35GB VRAM	e.g., A100 40GB
Hardware (SFT 2B)	NVIDIA GPU >= 36GB VRAM	e.g., A100 40GB (DDP)
Hardware (SFT 5B ZeRO-3)	8x NVIDIA GPU >= 28GB VRAM	e.g., 8x RTX 5090
Python	>= 3.10	pyproject.toml targets py310
CUDA	>= 11.0 (bf16 support required for 5B)	Ampere+ recommended
Disk	Sufficient for latent cache	Cached latents stored on disk per resolution

Dependencies

System Packages

NVIDIA CUDA Toolkit (compatible with PyTorch build)
`ffmpeg` (for video I/O via imageio-ffmpeg)

Python Packages

`torch` >= 2.5.1
`torchvision` >= 0.20.1
`diffusers` >= 0.32.2
`transformers` >= 4.46.3
`accelerate` >= 1.11.0
`deepspeed` >= 0.16.4
`peft` >= 0.13.2
`pydantic` >= 2.10.6
`datasets` >= 2.14.4
`decord` >= 0.6.0
`opencv-python` >= 4.11.0.86
`sentencepiece` >= 0.2.0
`numpy` == 1.26.0 (pinned)
`imageio` >= 2.37.0
`imageio-ffmpeg` >= 0.6.0
`moviepy` >= 2.2.1
`wandb` >= 0.19.7

Optional Packages

`torchao` — For 4-bit/8-bit optimizers or CPU offload optimizer (install with `USE_CPP=0 pip install torchao`)
`bitsandbytes` — For 8-bit Adam/AdamW optimizers
`prodigyopt` — For Prodigy self-adaptive optimizer
`came-pytorch` — For CAME optimizer

Credentials

No API tokens are required for the core finetuning pipeline. Models are loaded from public HuggingFace repositories.

Optional:

`OPENAI_API_KEY`: For prompt enhancement via OpenAI API (used in inference/demo scripts only)
`WANDB_API_KEY`: For Weights & Biases experiment logging (if wandb tracking is enabled)

Quick Install

# Install core dependencies
pip install torch>=2.5.1 torchvision>=0.20.1 diffusers>=0.32.2 transformers>=4.46.3 \
    accelerate>=1.11.0 deepspeed>=0.16.4 peft>=0.13.2 pydantic>=2.10.6 \
    datasets>=2.14.4 decord>=0.6.0 opencv-python>=4.11.0.86 sentencepiece>=0.2.0 \
    numpy==1.26.0 imageio>=2.37.0 imageio-ffmpeg>=0.6.0 moviepy>=2.2.1 wandb>=0.19.7

# Optional optimizers
pip install bitsandbytes prodigyopt came-pytorch
USE_CPP=0 pip install torchao

Code Evidence

Environment variable setup from `finetune/train_ddp_t2v.sh:1-3`:

export TOKENIZERS_PARALLELISM=false

Mixed precision validation from `finetune/schemas/args.py:170-175`:

@field_validator("mixed_precision")
def validate_mixed_precision(cls, v: str, info: ValidationInfo) -> str:
    if v == "fp16" and "cogvideox-2b" not in str(info.data.get("model_path", "")).lower():
        logging.warning(
            "All CogVideoX models except cogvideox-2b were trained with bfloat16. "
            "Using fp16 precision may lead to training instability."
        )
    return v

MPS bfloat16 check from `finetune/trainer.py:230-234`:

if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
    raise ValueError(
        "Mixed precision training with bfloat16 is not supported on MPS. "
        "Please use fp16 (recommended) or fp32 instead."
    )

DeepSpeed ZeRO-2 config from `finetune/configs/zero2.yaml:1-10`:

{
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "reduce_scatter": true
    },
    "bf16": { "enabled": true },
    "train_micro_batch_size_per_gpu": 1
}

Accelerate config targeting 8 GPUs from `finetune/accelerate_config.yaml`:

num_processes: 8
gpu_ids: "0,1,2,3,4,5,6,7"
deepspeed_config:
  deepspeed_config_file: configs/zero2.yaml

Common Errors

Error Message	Cause	Solution
`Mixed precision training with bfloat16 is not supported on MPS`	Running bf16 on Apple Silicon (pytorch#99272)	Use `--mixed_precision fp16` or `fp32`
`All CogVideoX models except cogvideox-2b were trained with bfloat16`	Using fp16 with a 5B model	Switch to `--mixed_precision bf16`
`validation_steps must be a multiple of checkpointing_steps`	Misaligned validation/checkpoint intervals	Set validation_steps as a multiple of checkpointing_steps
`For cogvideox-5b models, height must be 480 and width must be 720`	Wrong resolution for 5B model	Use 480x720 for CogVideoX-5B
CUDA OOM during training	Insufficient VRAM	Enable gradient checkpointing, VAE slicing/tiling, or use DeepSpeed ZeRO-3
NCCL timeout during distributed training	Latent precomputation takes too long	Increase `nccl_timeout` (default 1800s)

Compatibility Notes

CogVideoX-2B: Supports fp16 mixed precision only. Use `--mixed_precision fp16`.
CogVideoX-5B / CogVideoX1.5-5B: Requires bf16 mixed precision. Using fp16 leads to training instability.
Apple Silicon (MPS): bf16 not supported due to pytorch#99272. AMP is automatically disabled on MPS. fp16 or fp32 only.
Multi-GPU DDP: Requires `find_unused_parameters=True` because LoRA freezes most parameters.
DeepSpeed ZeRO-3: Validation is slow; recommended to disable validation when using ZeRO-3.
numpy: Pinned to 1.26.0 across the project. Do not upgrade.
TOKENIZERS_PARALLELISM: Must be set to `false` to prevent HuggingFace tokenizer issues in distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment