Implementation:Microsoft Onnxruntime FusedAdam FP16Optimizer
Overview
Provides GPU-optimized fused Adam optimizer (FusedAdam) and mixed-precision gradient scaling wrapper (FP16_Optimizer) for efficient parameter updates during ORTModule-accelerated training.
Metadata
| Field | Value |
|---|---|
| Implementation Name | FusedAdam_FP16Optimizer |
| Type | API Doc |
| Language | Python |
| API | onnxruntime.training.optim.FusedAdam(params, lr) and onnxruntime.training.optim.fp16_optimizer.FP16_Optimizer(optimizer)
|
| Import | from onnxruntime.training.optim import FusedAdam and from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
|
| Domain | Accelerated_Training, PyTorch_Integration |
| Repository | microsoft/onnxruntime |
| Source Reference | docs/ORTModule_Training_Guidelines.md:L332-339 (FusedAdam), L349-372 (FP16_Optimizer) |
| Last Updated | 2026-02-10 |
Description
This implementation covers two complementary optimization components:
FusedAdam
FusedAdam is a drop-in replacement for PyTorch's torch.optim.AdamW that uses multi-tensor apply to batch gradient updates across multiple parameters into a single GPU kernel launch. This eliminates the per-parameter kernel launch overhead that becomes significant for models with many parameter tensors.
FP16_Optimizer
FP16_Optimizer wraps any existing optimizer to provide dynamic loss scaling for mixed-precision training. It is designed to complement DeepSpeed and Apex libraries by addressing specific inefficiencies in their mixed-precision implementations.
FP16_Optimizer can wrap:
- DeepSpeed's ZeRO Optimizer
- Apex Optimizer (via
amp.initialize) - Standard PyTorch optimizers
API Signature
FusedAdam
from onnxruntime.training.optim import FusedAdam
optimizer = FusedAdam(model.parameters(), lr=1e-4)
FP16_Optimizer
from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
optimizer = FP16_Optimizer(base_optimizer)
Key Parameters
FusedAdam
| Parameter | Type | Description |
|---|---|---|
| params | iterable |
Iterable of parameters to optimize or dicts defining parameter groups |
| lr | float |
Learning rate (default: 1e-3) |
FP16_Optimizer
| Parameter | Type | Description |
|---|---|---|
| optimizer | torch.optim.Optimizer |
The base optimizer to wrap (can be DeepSpeed, Apex, or standard PyTorch optimizer) |
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input (FusedAdam) | Model parameters | Parameters from model.parameters()
|
| Output (FusedAdam) | FusedAdam optimizer |
Drop-in replacement for torch.optim.AdamW
|
| Input (FP16_Optimizer) | Base optimizer | Any torch.optim.Optimizer instance
|
| Output (FP16_Optimizer) | FP16_Optimizer |
Wrapped optimizer with dynamic loss scaling |
Code Reference
From docs/ORTModule_Training_Guidelines.md:
FusedAdam
model = build_model()
- optimizer = AdamW(model.parameters(), lr=1)
+ from onnxruntime.training.optim import FusedAdam
+ optimizer = FusedAdam(model.parameters(), lr=1)
FP16_Optimizer with DeepSpeed
optimizer = AdamW(model.parameters(), lr=1)
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model, optimizer=optimizer, args=args,
lr_scheduler=lr_scheduler, mpu=mpu, dist_init_required=False)
+ from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
+ optimizer = FP16_Optimizer(optimizer)
FP16_Optimizer with Apex
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level="O2")
+ from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer as ORT_FP16_Optimizer
+ optimizer = ORT_FP16_Optimizer(optimizer)
Usage Example
from onnxruntime.training.ortmodule import ORTModule
from onnxruntime.training.optim import FusedAdam
from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
# Build and wrap model
model = build_model()
model = ORTModule(model)
# Use FusedAdam instead of standard AdamW
optimizer = FusedAdam(model.parameters(), lr=1e-4)
# Integrate with DeepSpeed
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model, optimizer=optimizer, args=args,
lr_scheduler=lr_scheduler, mpu=mpu, dist_init_required=False,
)
# Wrap with FP16_Optimizer for mixed-precision
optimizer = FP16_Optimizer(optimizer)
# Training loop
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Implements
Principle:Microsoft_Onnxruntime_Fused_Optimizer_Configuration
Related Pages
- ORTModule Wrap -- Model wrapping that precedes optimizer setup
- ORTModule Training Execution -- The training loop using these optimizers
- Memory Opt Env Config -- Complementary memory optimization
- Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment