Principle:Microsoft Onnxruntime Fused Optimizer Configuration
Overview
Configuration of GPU-optimized fused optimizers for efficient parameter updates during ORTModule training.
Metadata
| Field | Value |
|---|---|
| Principle Name | Fused_Optimizer_Configuration |
| Category | API Doc |
| Domain | Accelerated_Training, PyTorch_Integration |
| Repository | microsoft/onnxruntime |
| Source Reference | docs/ORTModule_Training_Guidelines.md:L332-339 (FusedAdam), L349-372 (FP16_Optimizer)
|
| Last Updated | 2026-02-10 |
Description
ORT provides FusedAdam, a fused implementation of the Adam optimizer that combines multiple GPU kernel launches into a single operation. FP16_Optimizer wraps any optimizer to enable mixed-precision gradient scaling for faster training with reduced memory.
FusedAdam
Standard PyTorch Adam performs multiple separate GPU kernel launches for each parameter update: one for computing the first moment, one for the second moment, one for the bias correction, and one for the parameter update. FusedAdam combines all these operations using multi-tensor apply, which batches gradient updates across multiple parameters into a single kernel launch.
The result is significantly reduced GPU kernel launch overhead, which becomes especially important for models with many parameters (e.g., large language models with thousands of parameter tensors).
FP16_Optimizer
FP16_Optimizer wraps an existing optimizer (including DeepSpeed's ZeRO optimizer or Apex optimizer) to provide dynamic loss scaling for mixed-precision training. It maintains master weights in FP32 while performing forward and backward passes in FP16, and applies dynamic loss scaling to prevent gradient underflow.
FP16_Optimizer is designed to complement -- not replace -- DeepSpeed and Apex libraries, addressing specific inefficiencies in their mixed-precision implementations.
Theoretical Basis
Fused optimizers reduce GPU kernel launch overhead by combining parameter update operations. FP16_Optimizer implements dynamic loss scaling to prevent gradient underflow in mixed-precision training.
- Kernel Launch Overhead -- Each GPU kernel launch incurs a fixed overhead (typically 5-10 microseconds). For models with thousands of parameters, the cumulative overhead of per-parameter kernel launches can become a significant fraction of total training time. Multi-tensor apply reduces this by processing all parameters in a single launch.
- Mixed-Precision Training -- Using FP16 for forward/backward passes reduces memory footprint and enables tensor core acceleration on modern GPUs. However, FP16's limited dynamic range (approximately 5.96e-8 to 65504) can cause gradient underflow, where small gradient values round to zero.
- Dynamic Loss Scaling -- To prevent gradient underflow, the loss is multiplied by a large scaling factor before the backward pass. The resulting gradients are then unscaled before the optimizer step. The scale factor is dynamically adjusted: increased when no overflow is detected, decreased when overflow occurs.
Usage
model = build_model()
# Replace standard AdamW with FusedAdam
from onnxruntime.training.optim import FusedAdam
optimizer = FusedAdam(model.parameters(), lr=1e-4)
# Optionally wrap with FP16_Optimizer for mixed-precision
from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
optimizer = FP16_Optimizer(optimizer)
Implemented By
Implementation:Microsoft_Onnxruntime_FusedAdam_FP16Optimizer
Related Pages
- ORT Accelerated Training -- ORTModule acceleration that complements fused optimizers
- ORTModule Training Loop -- Uses fused optimizers in the training loop
- Memory Optimization -- Additional memory reduction techniques