Implementation:FMInference FlexLLMGen DeepSpeed BF16 Optimizer
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Upstream: DeepSpeed |
| Domains | Mixed_Precision_Training, Memory_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed optimizer wrapper that maintains BFloat16 model parameters with FP32 master copies, enabling memory-efficient mixed-precision training with ZeRO-style partitioning.
Description
The bf16_optimizer.py file (458 lines) is a vendored copy of DeepSpeed's BF16 optimizer, which extends the ZeROOptimizer base class. It manages the dual representation of model parameters: BFloat16 copies for forward/backward computation and FP32 master copies for numerically stable optimizer updates.
Key components include:
- BF16_Optimizer -- The main class that wraps an inner optimizer (e.g., Adam) and maintains parallel bf16/fp32 parameter groups. It handles:
- bf16_groups and bf16_groups_flat -- Flattened bf16 parameters used in forward/backward passes.
- fp32_groups_flat_partition -- FP32 master weights partitioned across data-parallel ranks.
- fp32_groups_gradients -- FP32 gradient views for numerically stable accumulation.
- Gradient clipping via get_global_norm_of_tensors and clip_tensors_by_global_norm.
- All-gather operations to synchronize bf16 weights across data-parallel ranks after optimizer steps.
- _setup_for_real_optimizer -- Initializes the bf16/fp32 parameter groups, creates flattened tensors, partitions them across data-parallel ranks, and sets up gradient buffers with proper NCCL alignment (4-byte boundary alignment for fp16/bf16).
The optimizer step flow is: (1) accumulate gradients in bf16, (2) copy to fp32 gradient buffers, (3) clip gradients globally, (4) run the inner optimizer on fp32 master weights, (5) copy updated fp32 weights back to bf16, (6) all-gather bf16 updates across ranks.
Usage
This optimizer is selected automatically by the DeepSpeed engine when bf16.enabled is set to true in the DeepSpeed configuration. It is part of the vendored benchmark dependencies in FlexLLMGen.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/runtime/bf16_optimizer.py |
| Lines | 1-458 |
| Type | AUTO_KEEP (vendored dependency) |
Key class signature:
class BF16_Optimizer(ZeROOptimizer):
def __init__(self, init_optimizer, param_names, mpu=None,
clip_grad=0.0, norm_type=2,
allgather_bucket_size=5000000000,
dp_process_group=None, timers=None):
...
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| init_optimizer | Optimizer | Yes | The inner optimizer (e.g., FusedAdam) that operates on FP32 master weights |
| param_names | dict | Yes | Mapping of parameter names for checkpoint saving |
| mpu | object | No | Model parallel unit for tensor model parallelism |
| clip_grad | float | No | Maximum gradient norm for clipping (default: 0.0 disables) |
| norm_type | int | No | Norm type for gradient clipping (default: 2 for L2) |
| allgather_bucket_size | int | No | Buffer size for NCCL all-gather operations (default: 5GB) |
| dp_process_group | ProcessGroup | No | Data parallel process group |
Outputs
| Output | Type | Description |
|---|---|---|
| updated bf16 params | Tensor | BFloat16 model parameters updated after optimizer step |
| state_dict | dict | Checkpoint-compatible state dictionary with FP32 master weights |