Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft Onnxruntime TensorBoard Monitoring

From Leeroopedia


Field Value
Principle Name TensorBoard_Monitoring
Overview Real-time training progress visualization through TensorBoard-compatible summary event logging.
Category API Doc
Domains Distributed_Training, Training_Infrastructure
Source Repository microsoft/onnxruntime
Last Updated 2026-02-10

Overview

Real-time training progress visualization through TensorBoard-compatible summary event logging. ONNX Runtime integrates TensorBoard summary operators directly into the training graph, enabling monitoring of loss convergence, weight distributions, gradient norms, and custom metrics.

Description

ONNX Runtime's training ops include TensorBoard summary operators (SummaryScalarOp, SummaryHistogramOp, SummaryTextOp, SummaryMergeOp) that serialize training metrics into TensorBoard's protobuf event format. These operators integrate into the training graph to log loss, learning rate, weight distributions, and custom metrics.

The summary operators are registered in the kMSDomain (Microsoft domain) and operate as follows:

SummaryScalarOp

Logs scalar values (loss, learning rate, accuracy) as TensorBoard scalar summaries. Supports float, double, and bool input types. Maintains a list of tag strings for naming multiple scalar values.

SummaryHistogramOp

Logs tensor distributions (weight values, gradient magnitudes) as TensorBoard histograms. Supports float and double input types. Each histogram is identified by a single tag string.

SummaryTextOp

Logs text strings as TensorBoard text summaries. Useful for logging configuration, hyperparameters, or qualitative samples during training.

SummaryMergeOp

Merges multiple summary protobufs into a single combined summary. This operator concatenates the outputs of multiple scalar, histogram, and text summary operators for efficient writing.

Configuration

TensorBoard integration is configured via TrainingRunner::Parameters:

  • log_dir: Path to the directory where TensorBoard event files are written.
  • summary_name: Name identifier for the summary node in the graph.
  • scalar_names: List of graph node names to log as scalars.
  • histogram_names: List of graph node names to log as histograms.
  • norm_names: List of graph node names whose norms should be logged.

TensorBoard logging is automatically enabled when log_dir is set and the process is not a performance test. Only rank 0 writes TensorBoard events (MPIContext::GetInstance().GetWorldRank() == 0) to avoid duplicate logging in distributed training.

Theoretical Basis

Training monitoring via scalar, histogram, and text summaries enables real-time diagnosis of training dynamics: loss convergence, gradient distribution, learning rate scheduling, and early detection of training instabilities.

Key monitoring signals:

  • Loss curves: Monotonically decreasing loss indicates convergence; plateaus or increases indicate learning rate issues or underfitting.
  • Gradient histograms: Healthy gradients have moderate magnitude and stable distribution. Vanishing gradients (near-zero) or exploding gradients (very large) indicate architecture or learning rate problems.
  • Weight distributions: Gradual, smooth weight evolution indicates stable training. Sudden changes or very large/small values indicate instabilities.
  • Learning rate schedule: Visualization confirms the scheduled learning rate changes are occurring as expected (warmup, decay, etc.).

Usage

TensorBoard monitoring is enabled through configuration and operates automatically:

  1. Set log_dir in TrainingRunner::Parameters to the desired TensorBoard log directory.
  2. Configure scalar_names, histogram_names, and norm_names for the desired metrics.
  3. The training loop automatically logs summaries during execution.
  4. Launch TensorBoard pointing to the log directory to visualize training progress.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment