Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Zai org CogVideo Training Hyperparameter Defaults

From Leeroopedia




Knowledge Sources
Domains Training, Hyperparameters, Video_Generation
Last Updated 2026-02-10 02:00 GMT

Overview

CogVideoX training defaults: lr=2e-5 (Diffusers LoRA), lr=1e-5 (SAT SFT), Adam betas=(0.9, 0.95), gradient clip 1.0 (Diffusers) or 0.1 (SAT), NCCL timeout 1800s, and 1000+ iterations for LoRA / 500+ for SFT.

Description

The CogVideoX training pipeline uses non-standard hyperparameter defaults that differ from typical deep learning training. The Adam beta2 value (0.95 vs standard 0.999) makes the optimizer more responsive to recent gradients, which is common in video/vision transformer training. The NCCL timeout is extended to 1800 seconds to accommodate the slow latent precomputation phase. The SAT framework uses a much more aggressive gradient clip (0.1) compared to the Diffusers pipeline (1.0).

Usage

Apply these defaults when starting a new training run unless you have specific reasons to deviate. These values have been empirically validated by the CogVideo team across multiple training runs.

The Insight (Rule of Thumb)

Learning rate:

  • Diffusers LoRA: `lr = 2e-5` (default).
  • SAT SFT: `lr = 1e-5`.
  • SAT LoRA: `lr = 1e-3` to `5e-4` (much higher due to fewer parameters).
  • Prodigy optimizer: `lr ≈ 1.0` (self-adaptive; typical lr values like 1e-5 result in extremely slow convergence).

Adam parameters:

  • beta1: 0.9 (standard).
  • beta2: 0.95 (NOT the standard 0.999). More responsive to recent gradients.
  • CAME optimizer: Uses triple betas and dual epsilon `(1e-30, 1e-16)`.

Gradient clipping:

  • Diffusers: `max_grad_norm = 1.0`.
  • SAT: `gradient_clipping = 0.1` (much more aggressive).

Training iterations:

  • LoRA: 1000+ iterations recommended.
  • SFT: 500+ iterations sufficient.

Distributed training:

  • NCCL timeout: 1800 seconds (30 minutes) to prevent spurious timeouts during latent precomputation.
  • Validation steps: Must be a multiple of checkpointing steps.
  • SFT validation: Peak VRAM may exceed 24GB; disable validation on low-VRAM GPUs.
  • ZeRO-3: Validation is slow; recommended to disable.

Reasoning

Adam beta2=0.95 from `finetune/schemas/args.py:54-55`:

beta1: float = 0.9
beta2: float = 0.95

Prodigy learning rate warning from `finetune/utils/optimizer_utils.py:122-125`:

if learning_rate <= 0.1:
    logger.warning(
        "Learning rate is too low. When using prodigy, it's generally better "
        "to set learning rate around 1.0"
    )

SAT training iterations from `sat/configs/sft.yaml:8`:

train_iters: 1000  # Suggest more than 1000 For Lora and SFT For 500 is enough

SAT learning rate from `sat/configs/sft.yaml:58`:

lr: 0.00001  # Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT

Validation/checkpoint alignment from `finetune/schemas/args.py:138-139`:

if values.get("checkpointing_steps") and v % values["checkpointing_steps"] != 0:
    raise ValueError("validation_steps must be a multiple of checkpointing_steps")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment