Principle:FMInference FlexLLMGen BFloat16 Mixed Precision Optimization

Field	Value
Sources	Upstream: DeepSpeed, Paper: FlexGen
Domains	Mixed_Precision_Training, Memory_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

A training strategy that uses BFloat16 (bf16) for forward and backward computation while maintaining FP32 master copies of weights for numerically stable optimizer updates, reducing memory usage by nearly half without sacrificing convergence.

Description

BFloat16 mixed-precision optimization uses the bf16 floating-point format (1 sign bit, 8 exponent bits, 7 mantissa bits) for the bulk of training computation. Unlike FP16 (which has 5 exponent bits and 10 mantissa bits), bf16 has the same dynamic range as FP32 (8 exponent bits), making it less prone to overflow/underflow and eliminating the need for loss scaling.

The dual-representation approach works as follows:

Forward pass -- Uses bf16 parameters, producing bf16 activations and loss.
Backward pass -- Computes bf16 gradients via backpropagation.
Gradient conversion -- Bf16 gradients are copied into FP32 gradient buffers for numerical precision during accumulation.
Gradient clipping -- Global gradient norm is computed in FP32 and clipped to prevent training instability.
Optimizer step -- The inner optimizer (e.g., Adam) updates FP32 master weights using FP32 gradients. Optimizer states (momentum, variance) remain in FP32.
Weight copy-back -- Updated FP32 master weights are cast back to bf16 for the next forward pass.
All-gather -- In data-parallel training, each rank's bf16 weight partition is gathered across all ranks.

This approach is combined with ZeRO-style partitioning: FP32 master weights are partitioned across data-parallel ranks, so each GPU only stores 1/N of the master weights (where N is the data-parallel world size). Flattened tensors are aligned to NCCL boundaries (4 bytes) for efficient communication.

Usage

Use bf16 mixed-precision when training on hardware with native bf16 support (e.g., NVIDIA A100, H100). It is preferred over fp16 mixed-precision because it does not require loss scaling and is more robust for large-scale training. The technique is enabled in DeepSpeed via the bf16.enabled configuration flag.

Theoretical Basis

Bf16 has 2^8 = 256 exponent values, matching FP32's dynamic range of approximately 1.2 x 10^-38 to 3.4 x 10^38. This eliminates the gradient underflow problem that plagues FP16 training. The reduced mantissa (7 bits vs. 23 bits in FP32) means individual operations have lower precision, but the key insight is that gradient accumulation and optimizer updates -- which are most sensitive to precision -- are performed in FP32. Memory savings come from storing parameters and activations in bf16 (2 bytes vs. 4 bytes), yielding approximately 50% reduction in model parameter memory.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment