Principle:Microsoft Onnxruntime Training Monitoring and Debugging
Overview
Monitoring and debugging tools for diagnosing convergence issues in ORTModule-accelerated training.
Metadata
| Field | Value |
|---|---|
| Principle Name | Training_Monitoring_and_Debugging |
| Category | API Doc |
| Domain | Accelerated_Training, PyTorch_Integration |
| Repository | microsoft/onnxruntime |
| Source Reference | docs/ORTModule_Convergence_Notes.md:L33-38 (subscriber), L91-115 (inspect_activation)
|
| Last Updated | 2026-02-10 |
Description
ONNX Runtime provides tools for comparing activation statistics between native PyTorch and ORTModule execution. The GlobalSubscriberManager captures per-step activation statistics, while inspect_activation provides inline tensor inspection for debugging convergence differences.
Convergence Investigation Workflow
When ORTModule-accelerated training shows different convergence behavior compared to native PyTorch, the debugging workflow follows these steps:
- Discovery -- Identify convergence issues through discrepancies in training loss, evaluation loss, or model-specific metrics (e.g., AUC). Runtime failures such as loss scaler reaching minimum also indicate convergence problems.
- Eliminate Randomness -- Before investigating, rule out randomness by setting deterministic seeds, setting dropout to 0, and configuring deterministic compute.
- Collect Statistics -- Use
GlobalSubscriberManagerandStatisticsSubscriberto capture per-step activation statistics for both baseline (PyTorch) and ORTModule runs. - Compare Activations -- Use the
merge_activation_summarytool to generate per-step summaries, then manually compare to find the first significant divergence point.
GlobalSubscriberManager
The GlobalSubscriberManager subscribes to nn.Module forward outputs across the entire model hierarchy. For each training step, it captures tensor statistics (min, max, mean, standard deviation, etc.) and writes them to files organized by step number.
The subscriber can be attached before or after wrapping with ORTModule. It works on both native PyTorch and ORTModule execution paths, enabling side-by-side comparison.
inspect_activation
For cases where GlobalSubscriberManager is insufficient (it only captures nn.Module forward outputs), inspect_activation can be inserted inline within a module's forward() method to capture intermediate tensor values. Each call must use a unique activation name to prevent file overwrites.
Theoretical Basis
Convergence debugging in accelerated training relies on comparing numerical behavior at each layer:
- Numerical Equivalence -- ORT's optimized kernels may produce slightly different floating-point results due to different operation ordering, fused operators, or different precision handling. Understanding where and how these differences accumulate is key to diagnosing convergence issues.
- Activation Statistics -- Summary statistics (min, max, mean, std) provide a compact representation of tensor distributions at each layer. Large discrepancies between PyTorch and ORTModule statistics at a particular layer indicate the source of divergence.
- Per-Step Analysis -- Convergence issues may only manifest after many training steps due to error accumulation. Collecting statistics over a range of steps helps identify when divergence begins.
Usage
# Baseline (native PyTorch)
from onnxruntime.training.utils.hooks import GlobalSubscriberManager, StatisticsSubscriber
GlobalSubscriberManager.subscribe(
model, [StatisticsSubscriber(output_dir="pt_out", override_output_dir=True)]
)
# ORTModule
model = ORTModule(model)
from onnxruntime.training.utils.hooks import GlobalSubscriberManager, StatisticsSubscriber
GlobalSubscriberManager.subscribe(
model, [StatisticsSubscriber(output_dir="ort_out", override_output_dir=True)]
)
Implemented By
Implementation:Microsoft_Onnxruntime_GlobalSubscriberManager_Usage
Related Pages
- ORTModule Training Loop -- The training loop being monitored
- ORT Accelerated Training -- The acceleration mechanism being debugged
- Heuristic:Microsoft_Onnxruntime_Convergence_Debugging_Tips