Principle:Microsoft Onnxruntime Distributed Training Loop

Field	Value
Principle Name	Distributed_Training_Loop
Overview	Orchestration of the distributed training loop across multiple GPUs and nodes with gradient synchronization.
Category	API Doc
Domains	Distributed_Training, Training_Infrastructure
Source Repository	microsoft/onnxruntime
Last Updated	2026-02-10

Overview

Orchestration of the distributed training loop across multiple GPUs and nodes with gradient synchronization. TrainingRunner::Run() orchestrates the full distributed training pipeline: iterating over data batches, executing forward/backward passes, synchronizing gradients across processes via NCCL AllReduce, updating parameters, and periodically evaluating and checkpointing.

Description

TrainingRunner::Run() orchestrates the full distributed training loop through the following stages:

Per-Step Execution

For each training step, the loop performs:

Data fetching: Retrieves the current batch from the DataLoader via PrepareFeedNamesAndFeeds().
Forward pass: Computes model outputs and loss.
Backward pass: Computes gradients for all trainable parameters.
Gradient accumulation: If gradient_accumulation_steps > 1, gradients are accumulated across multiple micro-batches before the weight update.
Gradient synchronization: NCCL AllReduce averages gradients across all data-parallel replicas.
Parameter update: The optimizer (Adam/Lamb/SGD) applies the gradient update to model parameters.
Learning rate scheduling: The learning rate is adjusted according to the configured schedule.

Pipeline Parallel Execution

When pipeline_parallel_size > 1, the execution pattern changes:

The model is partitioned across pipeline stages, with each rank executing one stage.
The PipelineScheduler manages the micro-batch scheduling across pipeline stages.
Forward and backward passes are interleaved across micro-batches to improve pipeline utilization.

Periodic Operations

The training loop also performs periodic operations:

Loss display: Logs the current training loss at display_loss_steps intervals.
Evaluation: Runs the model on test data at evaluation_period intervals.
Checkpointing: Saves model state at checkpoint_period intervals.
TensorBoard logging: Writes scalar, histogram, and norm summaries if TensorBoard is configured.

Session Modes

The loop operates in three session modes:

ModelUpdateStep: Full forward/backward/update cycle.
GradientAccumulateStep: Forward/backward only, accumulating gradients.
EvaluateStep: Forward pass only for evaluation.

Theoretical Basis

Data-parallel distributed training replicates the model on each GPU, computes gradients independently on different data, then averages gradients across all replicas before the parameter update. This maintains model consistency while increasing effective batch size.

The key theoretical properties:

Gradient averaging: AllReduce computes the mean gradient across all replicas, which is mathematically equivalent to computing the gradient on the union of all replicas' data.
Linear scaling: The effective batch size scales linearly with the number of replicas: effective_batch = batch_size * data_parallel_size.
Gradient accumulation: Accumulating gradients over K micro-batches before updating is equivalent to using a K-times larger batch, but with reduced memory requirements.
Synchronous training: All replicas perform the same number of updates and see the same averaged gradients, maintaining model consistency.

Usage

The training loop is the core execution step:

Initialize the TrainingRunner with parameters and environment.
Create DataLoader instances for training and test data.
Call TrainingRunner::Run() with the data loaders.
The method returns when all num_train_steps are completed.
The trained model parameters are available for export.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment