Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft Onnxruntime FusedAdam FP16Optimizer

From Leeroopedia


Overview

Provides GPU-optimized fused Adam optimizer (FusedAdam) and mixed-precision gradient scaling wrapper (FP16_Optimizer) for efficient parameter updates during ORTModule-accelerated training.

Metadata

Field Value
Implementation Name FusedAdam_FP16Optimizer
Type API Doc
Language Python
API onnxruntime.training.optim.FusedAdam(params, lr) and onnxruntime.training.optim.fp16_optimizer.FP16_Optimizer(optimizer)
Import from onnxruntime.training.optim import FusedAdam and from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
Domain Accelerated_Training, PyTorch_Integration
Repository microsoft/onnxruntime
Source Reference docs/ORTModule_Training_Guidelines.md:L332-339 (FusedAdam), L349-372 (FP16_Optimizer)
Last Updated 2026-02-10

Description

This implementation covers two complementary optimization components:

FusedAdam

FusedAdam is a drop-in replacement for PyTorch's torch.optim.AdamW that uses multi-tensor apply to batch gradient updates across multiple parameters into a single GPU kernel launch. This eliminates the per-parameter kernel launch overhead that becomes significant for models with many parameter tensors.

FP16_Optimizer

FP16_Optimizer wraps any existing optimizer to provide dynamic loss scaling for mixed-precision training. It is designed to complement DeepSpeed and Apex libraries by addressing specific inefficiencies in their mixed-precision implementations.

FP16_Optimizer can wrap:

  • DeepSpeed's ZeRO Optimizer
  • Apex Optimizer (via amp.initialize)
  • Standard PyTorch optimizers

API Signature

FusedAdam

from onnxruntime.training.optim import FusedAdam

optimizer = FusedAdam(model.parameters(), lr=1e-4)

FP16_Optimizer

from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer

optimizer = FP16_Optimizer(base_optimizer)

Key Parameters

FusedAdam

Parameter Type Description
params iterable Iterable of parameters to optimize or dicts defining parameter groups
lr float Learning rate (default: 1e-3)

FP16_Optimizer

Parameter Type Description
optimizer torch.optim.Optimizer The base optimizer to wrap (can be DeepSpeed, Apex, or standard PyTorch optimizer)

I/O Contract

Direction Type Description
Input (FusedAdam) Model parameters Parameters from model.parameters()
Output (FusedAdam) FusedAdam optimizer Drop-in replacement for torch.optim.AdamW
Input (FP16_Optimizer) Base optimizer Any torch.optim.Optimizer instance
Output (FP16_Optimizer) FP16_Optimizer Wrapped optimizer with dynamic loss scaling

Code Reference

From docs/ORTModule_Training_Guidelines.md:

FusedAdam

	model = build_model()

-	optimizer = AdamW(model.parameters(), lr=1)
+	from onnxruntime.training.optim import FusedAdam
+	optimizer = FusedAdam(model.parameters(), lr=1)

FP16_Optimizer with DeepSpeed

	optimizer = AdamW(model.parameters(), lr=1)
	model, optimizer, _, lr_scheduler = deepspeed.initialize(
			model=model, optimizer=optimizer, args=args,
			lr_scheduler=lr_scheduler, mpu=mpu, dist_init_required=False)

+	from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
+	optimizer = FP16_Optimizer(optimizer)

FP16_Optimizer with Apex

	optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
	model, optimizer = amp.initialize(model, optimizer, opt_level="O2")

+	from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer as ORT_FP16_Optimizer
+	optimizer = ORT_FP16_Optimizer(optimizer)

Usage Example

from onnxruntime.training.ortmodule import ORTModule
from onnxruntime.training.optim import FusedAdam
from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer

# Build and wrap model
model = build_model()
model = ORTModule(model)

# Use FusedAdam instead of standard AdamW
optimizer = FusedAdam(model.parameters(), lr=1e-4)

# Integrate with DeepSpeed
model, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model, optimizer=optimizer, args=args,
    lr_scheduler=lr_scheduler, mpu=mpu, dist_init_required=False,
)

# Wrap with FP16_Optimizer for mixed-precision
optimizer = FP16_Optimizer(optimizer)

# Training loop
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Implements

Principle:Microsoft_Onnxruntime_Fused_Optimizer_Configuration

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment