Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Zai org CogVideo Diffusers Finetuning Environment

From Leeroopedia


Knowledge Sources
Domains Video_Generation, Deep_Learning, Finetuning
Last Updated 2026-02-10 02:00 GMT

Overview

Ubuntu/Linux GPU environment with Python 3.10+, PyTorch >= 2.5.1, CUDA (bf16/fp16), HuggingFace Diffusers >= 0.32.2, DeepSpeed >= 0.16.4, and HuggingFace Accelerate >= 1.11.0 for LoRA and SFT fine-tuning of CogVideoX video diffusion models.

Description

This environment provides the full stack for fine-tuning CogVideoX models using the HuggingFace Diffusers-based training pipeline. It supports both LoRA (Low-Rank Adaptation) and full SFT (Supervised Fine-Tuning) with distributed training via HuggingFace Accelerate and DeepSpeed ZeRO Stage 2/3. The pipeline uses bf16 mixed precision for CogVideoX-5B models and fp16 for CogVideoX-2B. GPU memory management relies on gradient checkpointing, VAE slicing/tiling, and precomputed latent caching.

Usage

Use this environment for any LoRA fine-tuning or full SFT workflow on CogVideoX models using the Diffusers-based pipeline in the `finetune/` package. This is the mandatory prerequisite for running the training scripts (`train_ddp_t2v.sh`, `train_ddp_i2v.sh`, `train_zero_t2v.sh`, `train_zero_i2v.sh`).

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) CUDA-compatible OS required
Hardware (LoRA 2B) NVIDIA GPU >= 16GB VRAM e.g., RTX 4080
Hardware (LoRA 5B) NVIDIA GPU >= 24GB VRAM e.g., RTX 4090
Hardware (LoRA 1.5-5B) NVIDIA GPU >= 35GB VRAM e.g., A100 40GB
Hardware (SFT 2B) NVIDIA GPU >= 36GB VRAM e.g., A100 40GB (DDP)
Hardware (SFT 5B ZeRO-3) 8x NVIDIA GPU >= 28GB VRAM e.g., 8x RTX 5090
Python >= 3.10 pyproject.toml targets py310
CUDA >= 11.0 (bf16 support required for 5B) Ampere+ recommended
Disk Sufficient for latent cache Cached latents stored on disk per resolution

Dependencies

System Packages

  • NVIDIA CUDA Toolkit (compatible with PyTorch build)
  • `ffmpeg` (for video I/O via imageio-ffmpeg)

Python Packages

  • `torch` >= 2.5.1
  • `torchvision` >= 0.20.1
  • `diffusers` >= 0.32.2
  • `transformers` >= 4.46.3
  • `accelerate` >= 1.11.0
  • `deepspeed` >= 0.16.4
  • `peft` >= 0.13.2
  • `pydantic` >= 2.10.6
  • `datasets` >= 2.14.4
  • `decord` >= 0.6.0
  • `opencv-python` >= 4.11.0.86
  • `sentencepiece` >= 0.2.0
  • `numpy` == 1.26.0 (pinned)
  • `imageio` >= 2.37.0
  • `imageio-ffmpeg` >= 0.6.0
  • `moviepy` >= 2.2.1
  • `wandb` >= 0.19.7

Optional Packages

  • `torchao` — For 4-bit/8-bit optimizers or CPU offload optimizer (install with `USE_CPP=0 pip install torchao`)
  • `bitsandbytes` — For 8-bit Adam/AdamW optimizers
  • `prodigyopt` — For Prodigy self-adaptive optimizer
  • `came-pytorch` — For CAME optimizer

Credentials

No API tokens are required for the core finetuning pipeline. Models are loaded from public HuggingFace repositories.

Optional:

  • `OPENAI_API_KEY`: For prompt enhancement via OpenAI API (used in inference/demo scripts only)
  • `WANDB_API_KEY`: For Weights & Biases experiment logging (if wandb tracking is enabled)

Quick Install

# Install core dependencies
pip install torch>=2.5.1 torchvision>=0.20.1 diffusers>=0.32.2 transformers>=4.46.3 \
    accelerate>=1.11.0 deepspeed>=0.16.4 peft>=0.13.2 pydantic>=2.10.6 \
    datasets>=2.14.4 decord>=0.6.0 opencv-python>=4.11.0.86 sentencepiece>=0.2.0 \
    numpy==1.26.0 imageio>=2.37.0 imageio-ffmpeg>=0.6.0 moviepy>=2.2.1 wandb>=0.19.7

# Optional optimizers
pip install bitsandbytes prodigyopt came-pytorch
USE_CPP=0 pip install torchao

Code Evidence

Environment variable setup from `finetune/train_ddp_t2v.sh:1-3`:

export TOKENIZERS_PARALLELISM=false

Mixed precision validation from `finetune/schemas/args.py:170-175`:

@field_validator("mixed_precision")
def validate_mixed_precision(cls, v: str, info: ValidationInfo) -> str:
    if v == "fp16" and "cogvideox-2b" not in str(info.data.get("model_path", "")).lower():
        logging.warning(
            "All CogVideoX models except cogvideox-2b were trained with bfloat16. "
            "Using fp16 precision may lead to training instability."
        )
    return v

MPS bfloat16 check from `finetune/trainer.py:230-234`:

if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
    raise ValueError(
        "Mixed precision training with bfloat16 is not supported on MPS. "
        "Please use fp16 (recommended) or fp32 instead."
    )

DeepSpeed ZeRO-2 config from `finetune/configs/zero2.yaml:1-10`:

{
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "reduce_scatter": true
    },
    "bf16": { "enabled": true },
    "train_micro_batch_size_per_gpu": 1
}

Accelerate config targeting 8 GPUs from `finetune/accelerate_config.yaml`:

num_processes: 8
gpu_ids: "0,1,2,3,4,5,6,7"
deepspeed_config:
  deepspeed_config_file: configs/zero2.yaml

Common Errors

Error Message Cause Solution
`Mixed precision training with bfloat16 is not supported on MPS` Running bf16 on Apple Silicon (pytorch#99272) Use `--mixed_precision fp16` or `fp32`
`All CogVideoX models except cogvideox-2b were trained with bfloat16` Using fp16 with a 5B model Switch to `--mixed_precision bf16`
`validation_steps must be a multiple of checkpointing_steps` Misaligned validation/checkpoint intervals Set validation_steps as a multiple of checkpointing_steps
`For cogvideox-5b models, height must be 480 and width must be 720` Wrong resolution for 5B model Use 480x720 for CogVideoX-5B
CUDA OOM during training Insufficient VRAM Enable gradient checkpointing, VAE slicing/tiling, or use DeepSpeed ZeRO-3
NCCL timeout during distributed training Latent precomputation takes too long Increase `nccl_timeout` (default 1800s)

Compatibility Notes

  • CogVideoX-2B: Supports fp16 mixed precision only. Use `--mixed_precision fp16`.
  • CogVideoX-5B / CogVideoX1.5-5B: Requires bf16 mixed precision. Using fp16 leads to training instability.
  • Apple Silicon (MPS): bf16 not supported due to pytorch#99272. AMP is automatically disabled on MPS. fp16 or fp32 only.
  • Multi-GPU DDP: Requires `find_unused_parameters=True` because LoRA freezes most parameters.
  • DeepSpeed ZeRO-3: Validation is slow; recommended to disable validation when using ZeRO-3.
  • numpy: Pinned to 1.26.0 across the project. Do not upgrade.
  • TOKENIZERS_PARALLELISM: Must be set to `false` to prevent HuggingFace tokenizer issues in distributed training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment