Principle:Microsoft Onnxruntime Fused Optimizer Configuration

Overview

Configuration of GPU-optimized fused optimizers for efficient parameter updates during ORTModule training.

Metadata

Field	Value
Principle Name	Fused_Optimizer_Configuration
Category	API Doc
Domain	Accelerated_Training, PyTorch_Integration
Repository	microsoft/onnxruntime
Source Reference	`docs/ORTModule_Training_Guidelines.md:L332-339` (FusedAdam), `L349-372` (FP16_Optimizer)
Last Updated	2026-02-10

Description

ORT provides FusedAdam, a fused implementation of the Adam optimizer that combines multiple GPU kernel launches into a single operation. FP16_Optimizer wraps any optimizer to enable mixed-precision gradient scaling for faster training with reduced memory.

FusedAdam

Standard PyTorch Adam performs multiple separate GPU kernel launches for each parameter update: one for computing the first moment, one for the second moment, one for the bias correction, and one for the parameter update. FusedAdam combines all these operations using multi-tensor apply, which batches gradient updates across multiple parameters into a single kernel launch.

The result is significantly reduced GPU kernel launch overhead, which becomes especially important for models with many parameters (e.g., large language models with thousands of parameter tensors).

FP16_Optimizer

FP16_Optimizer wraps an existing optimizer (including DeepSpeed's ZeRO optimizer or Apex optimizer) to provide dynamic loss scaling for mixed-precision training. It maintains master weights in FP32 while performing forward and backward passes in FP16, and applies dynamic loss scaling to prevent gradient underflow.

FP16_Optimizer is designed to complement -- not replace -- DeepSpeed and Apex libraries, addressing specific inefficiencies in their mixed-precision implementations.

Theoretical Basis

Fused optimizers reduce GPU kernel launch overhead by combining parameter update operations. FP16_Optimizer implements dynamic loss scaling to prevent gradient underflow in mixed-precision training.

Kernel Launch Overhead -- Each GPU kernel launch incurs a fixed overhead (typically 5-10 microseconds). For models with thousands of parameters, the cumulative overhead of per-parameter kernel launches can become a significant fraction of total training time. Multi-tensor apply reduces this by processing all parameters in a single launch.
Mixed-Precision Training -- Using FP16 for forward/backward passes reduces memory footprint and enables tensor core acceleration on modern GPUs. However, FP16's limited dynamic range (approximately 5.96e-8 to 65504) can cause gradient underflow, where small gradient values round to zero.
Dynamic Loss Scaling -- To prevent gradient underflow, the loss is multiplied by a large scaling factor before the backward pass. The resulting gradients are then unscaled before the optimizer step. The scale factor is dynamically adjusted: increased when no overflow is detected, decreased when overflow occurs.

Usage

model = build_model()

# Replace standard AdamW with FusedAdam
from onnxruntime.training.optim import FusedAdam
optimizer = FusedAdam(model.parameters(), lr=1e-4)

# Optionally wrap with FP16_Optimizer for mixed-precision
from onnxruntime.training.optim.fp16_optimizer import FP16_Optimizer
optimizer = FP16_Optimizer(optimizer)

Implemented By

Implementation:Microsoft_Onnxruntime_FusedAdam_FP16Optimizer

Related Pages

ORT Accelerated Training -- ORTModule acceleration that complements fused optimizers
ORTModule Training Loop -- Uses fused optimizers in the training loop
Memory Optimization -- Additional memory reduction techniques

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment