Principle:Microsoft Onnxruntime Training Monitoring and Debugging

Overview

Monitoring and debugging tools for diagnosing convergence issues in ORTModule-accelerated training.

Metadata

Field	Value
Principle Name	Training_Monitoring_and_Debugging
Category	API Doc
Domain	Accelerated_Training, PyTorch_Integration
Repository	microsoft/onnxruntime
Source Reference	`docs/ORTModule_Convergence_Notes.md:L33-38` (subscriber), `L91-115` (inspect_activation)
Last Updated	2026-02-10

Description

ONNX Runtime provides tools for comparing activation statistics between native PyTorch and ORTModule execution. The GlobalSubscriberManager captures per-step activation statistics, while inspect_activation provides inline tensor inspection for debugging convergence differences.

Convergence Investigation Workflow

When ORTModule-accelerated training shows different convergence behavior compared to native PyTorch, the debugging workflow follows these steps:

Discovery -- Identify convergence issues through discrepancies in training loss, evaluation loss, or model-specific metrics (e.g., AUC). Runtime failures such as loss scaler reaching minimum also indicate convergence problems.
Eliminate Randomness -- Before investigating, rule out randomness by setting deterministic seeds, setting dropout to 0, and configuring deterministic compute.
Collect Statistics -- Use GlobalSubscriberManager and StatisticsSubscriber to capture per-step activation statistics for both baseline (PyTorch) and ORTModule runs.
Compare Activations -- Use the merge_activation_summary tool to generate per-step summaries, then manually compare to find the first significant divergence point.

GlobalSubscriberManager

The GlobalSubscriberManager subscribes to nn.Module forward outputs across the entire model hierarchy. For each training step, it captures tensor statistics (min, max, mean, standard deviation, etc.) and writes them to files organized by step number.

The subscriber can be attached before or after wrapping with ORTModule. It works on both native PyTorch and ORTModule execution paths, enabling side-by-side comparison.

inspect_activation

For cases where GlobalSubscriberManager is insufficient (it only captures nn.Module forward outputs), inspect_activation can be inserted inline within a module's forward() method to capture intermediate tensor values. Each call must use a unique activation name to prevent file overwrites.

Theoretical Basis

Convergence debugging in accelerated training relies on comparing numerical behavior at each layer:

Numerical Equivalence -- ORT's optimized kernels may produce slightly different floating-point results due to different operation ordering, fused operators, or different precision handling. Understanding where and how these differences accumulate is key to diagnosing convergence issues.
Activation Statistics -- Summary statistics (min, max, mean, std) provide a compact representation of tensor distributions at each layer. Large discrepancies between PyTorch and ORTModule statistics at a particular layer indicate the source of divergence.
Per-Step Analysis -- Convergence issues may only manifest after many training steps due to error accumulation. Collecting statistics over a range of steps helps identify when divergence begins.

Usage

# Baseline (native PyTorch)
from onnxruntime.training.utils.hooks import GlobalSubscriberManager, StatisticsSubscriber
GlobalSubscriberManager.subscribe(
    model, [StatisticsSubscriber(output_dir="pt_out", override_output_dir=True)]
)

# ORTModule
model = ORTModule(model)
from onnxruntime.training.utils.hooks import GlobalSubscriberManager, StatisticsSubscriber
GlobalSubscriberManager.subscribe(
    model, [StatisticsSubscriber(output_dir="ort_out", override_output_dir=True)]
)

Implemented By

Implementation:Microsoft_Onnxruntime_GlobalSubscriberManager_Usage

Related Pages

ORTModule Training Loop -- The training loop being monitored
ORT Accelerated Training -- The acceleration mechanism being debugged
Heuristic:Microsoft_Onnxruntime_Convergence_Debugging_Tips

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment