Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Deepspeedai DeepSpeed FP16 Convergence Tips

From Leeroopedia



Knowledge Sources
Domains Optimization, Deep_Learning, Numerical_Stability
Last Updated 2026-02-09 00:00 GMT

Overview

Critical precision and convergence constraints: 1-bit Adam/Lamb/0-1 Adam only verified under FP16; AMP and ZeRO are incompatible; MAX_GRAD_NORM silently zeroed in FP32 mode.

Description

DeepSpeed has several precision-related constraints that affect training convergence. The communication-efficient optimizers (1-bit Adam, 0/1 Adam, 1-bit Lamb) have only been verified for convergence under FP16 training. Using them with BF16 or FP32 may produce incorrect results. Additionally, NVIDIA Apex AMP and ZeRO cannot be used simultaneously, and MAX_GRAD_NORM behaves differently depending on the precision mode: in FP16, it is passed to the FP16 wrapper, but in FP32, it is silently set to 0.0 (disabled). The FP16 dynamic loss scaling uses tuned defaults: `initial_scale_power=16`, `loss_scale_window=1000`, `hysteresis=2`.

Usage

Use this heuristic when configuring mixed-precision training or selecting an optimizer. If you encounter NaN losses or divergence, check that your precision settings are compatible with your chosen optimizer. If using 1-bit optimizers, ensure FP16 is enabled. If you need gradient clipping in FP32 mode, be aware it will be silently disabled.

The Insight (Rule of Thumb)

  • Action 1: Use FP16 (not BF16 or FP32) with 1-bit Adam, 0/1 Adam, or 1-bit Lamb optimizers.
  • Action 2: Do not combine Apex AMP with ZeRO. Use DeepSpeed's native FP16 mode instead, which "performs similar to amp opt_mode=O2".
  • Action 3: Be aware that `max_grad_norm` in the optimizer config is ignored (set to 0.0) in FP32 mode.
  • Action 4: When using ZeRO with an untested optimizer, set `"zero_allow_untested_optimizer": true` in the config.
  • Value: FP16 dynamic loss scaling defaults: `initial_scale_power=16` (starting scale=65536), `loss_scale_window=1000` steps, `hysteresis=2`.
  • Trade-off: FP16 requires dynamic loss scaling overhead; BF16 avoids loss scaling but has lower precision mantissa.

Reasoning

1-bit communication-efficient optimizers use error compensation mechanisms calibrated specifically for FP16 gradient distributions. Using BF16 (which has a different mantissa precision and range) can cause the error feedback to diverge. The AMP/ZeRO incompatibility arises because both systems want to manage gradient scaling and type conversion. DeepSpeed's FP16 wrapper already provides the functionality of Apex AMP O2 level, making AMP redundant.

The `max_grad_norm` silencing in FP32 mode is because DeepSpeed's gradient clipping in non-FP16 mode is handled differently and the optimizer parameter is not applicable. This is a common source of confusion.

Code Evidence

1-bit Adam FP16 warning from `deepspeed/runtime/engine.py:1663`:

logger.warning("Currently the convergence of 1-bit Adam is only verified under FP16")

0/1 Adam FP16 warning from `deepspeed/runtime/engine.py:1670`:

logger.warning('Currently the convergence of 0/1 Adam is only verified under FP16')

1-bit Lamb FP16 warning from `deepspeed/runtime/engine.py:1677`:

logger.warning("Currently the convergence of 1-bit Lamb is only verified under FP16")

AMP + ZeRO incompatibility from `deepspeed/runtime/engine.py:1511-1513`:

assert (
    not (amp_enabled and zero_enabled)
), "Amp and ZeRO are not currently compatible, please use (legacy) fp16 mode which performs similar to amp opt_mode=O2"

MAX_GRAD_NORM silencing from `deepspeed/runtime/config.py:1008-1012`:

logger.warning(
    "DeepSpeedConfig: In FP32 mode, DeepSpeed does not permit MAX_GRAD_NORM ({}) > 0, "
    "setting to zero".format(self.optimizer_params[MAX_GRAD_NORM]))
self.optimizer_params[MAX_GRAD_NORM] = 0.0

Untested optimizer warning from `deepspeed/runtime/engine.py:1516-1521`:

if not is_zero_supported_optimizer(basic_optimizer):
    assert (
        self.zero_allow_untested_optimizer()
    ), 'You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true>'
    logger.warning("**** You are using ZeRO with an untested optimizer, proceed with caution *****")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment