Heuristic:Zai org CogVideo BF16 FP16 Precision Selection

Knowledge Sources	CogVideo Training stability observations
Domains	Precision, Training_Configuration, Video_Generation
Last Updated	2026-02-10 02:00 GMT

Overview

Precision selection rule: CogVideoX-2B requires fp16 mixed precision; all 5B models require bf16. Using the wrong precision causes training instability and NaN losses.

Description

CogVideoX models were trained with specific floating-point precision and are sensitive to precision mismatches during fine-tuning. The 2B model was trained in fp16, while all 5B variants (CogVideoX-5B, CogVideoX1.5-5B) were trained in bf16. Using fp16 with a 5B model leads to numerical overflow in gradient computation due to bf16's larger dynamic range, causing training instability, NaN losses, and degraded output quality. Additionally, Apple Silicon (MPS) devices do not support bf16 due to pytorch#99272.

Usage

Apply this heuristic before starting any training run to select the correct `--mixed_precision` setting. Also check when deploying to non-NVIDIA hardware (MPS, Intel XPU) where bf16 may not be supported.

The Insight (Rule of Thumb)

Action: Set `--mixed_precision` based on model size.
Value:
- CogVideoX-2B: `--mixed_precision fp16`
- CogVideoX-5B / CogVideoX1.5-5B: `--mixed_precision bf16`
- SAT CogVideoX-2B: `fp16.enabled: True`, `bf16.enabled: False` in sft.yaml
- SAT CogVideoX-5B: `bf16.enabled: True`, `fp16.enabled: False` in sft.yaml
Trade-off: No trade-off — this is a hard constraint from the model's training distribution. Using the wrong precision degrades quality.
Platform exception: On Apple Silicon (MPS), bf16 is unsupported; use fp16 or fp32 only. AMP is automatically disabled on MPS.
GPU exception: On GPUs with compute capability < 8 (pre-Ampere), bf16 is not natively supported; the captioning pipeline auto-detects and falls back to fp16.

Reasoning

BF16 has a larger exponent range (8 bits) than fp16 (5 bits), allowing it to represent values up to ~3.4e38 vs fp16's ~65504. The 5B CogVideoX models produce activations and gradients that frequently exceed fp16's range during training, causing overflow. Since the models were pre-trained in their respective precisions, fine-tuning must match to maintain the learned weight distributions.

The code enforces this via a validator in `finetune/schemas/args.py:170-175`:

@field_validator("mixed_precision")
def validate_mixed_precision(cls, v: str, info: ValidationInfo) -> str:
    if v == "fp16" and "cogvideox-2b" not in str(info.data.get("model_path", "")).lower():
        logging.warning(
            "All CogVideoX models except cogvideox-2b were trained with bfloat16. "
            "Using fp16 precision may lead to training instability."
        )
    return v

And a hard error for MPS in `finetune/trainer.py:230-234`:

if torch.backends.mps.is_available() and weight_dtype == torch.bfloat16:
    raise ValueError(
        "Mixed precision training with bfloat16 is not supported on MPS. "
        "Please use fp16 (recommended) or fp32 instead."
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment