Principle:Microsoft Onnxruntime Distributed Training Loop
| Field | Value |
|---|---|
| Principle Name | Distributed_Training_Loop |
| Overview | Orchestration of the distributed training loop across multiple GPUs and nodes with gradient synchronization. |
| Category | API Doc |
| Domains | Distributed_Training, Training_Infrastructure |
| Source Repository | microsoft/onnxruntime |
| Last Updated | 2026-02-10 |
Overview
Orchestration of the distributed training loop across multiple GPUs and nodes with gradient synchronization. TrainingRunner::Run() orchestrates the full distributed training pipeline: iterating over data batches, executing forward/backward passes, synchronizing gradients across processes via NCCL AllReduce, updating parameters, and periodically evaluating and checkpointing.
Description
TrainingRunner::Run() orchestrates the full distributed training loop through the following stages:
Per-Step Execution
For each training step, the loop performs:
- Data fetching: Retrieves the current batch from the DataLoader via PrepareFeedNamesAndFeeds().
- Forward pass: Computes model outputs and loss.
- Backward pass: Computes gradients for all trainable parameters.
- Gradient accumulation: If gradient_accumulation_steps > 1, gradients are accumulated across multiple micro-batches before the weight update.
- Gradient synchronization: NCCL AllReduce averages gradients across all data-parallel replicas.
- Parameter update: The optimizer (Adam/Lamb/SGD) applies the gradient update to model parameters.
- Learning rate scheduling: The learning rate is adjusted according to the configured schedule.
Pipeline Parallel Execution
When pipeline_parallel_size > 1, the execution pattern changes:
- The model is partitioned across pipeline stages, with each rank executing one stage.
- The PipelineScheduler manages the micro-batch scheduling across pipeline stages.
- Forward and backward passes are interleaved across micro-batches to improve pipeline utilization.
Periodic Operations
The training loop also performs periodic operations:
- Loss display: Logs the current training loss at display_loss_steps intervals.
- Evaluation: Runs the model on test data at evaluation_period intervals.
- Checkpointing: Saves model state at checkpoint_period intervals.
- TensorBoard logging: Writes scalar, histogram, and norm summaries if TensorBoard is configured.
Session Modes
The loop operates in three session modes:
- ModelUpdateStep: Full forward/backward/update cycle.
- GradientAccumulateStep: Forward/backward only, accumulating gradients.
- EvaluateStep: Forward pass only for evaluation.
Theoretical Basis
Data-parallel distributed training replicates the model on each GPU, computes gradients independently on different data, then averages gradients across all replicas before the parameter update. This maintains model consistency while increasing effective batch size.
The key theoretical properties:
- Gradient averaging: AllReduce computes the mean gradient across all replicas, which is mathematically equivalent to computing the gradient on the union of all replicas' data.
- Linear scaling: The effective batch size scales linearly with the number of replicas: effective_batch = batch_size * data_parallel_size.
- Gradient accumulation: Accumulating gradients over K micro-batches before updating is equivalent to using a K-times larger batch, but with reduced memory requirements.
- Synchronous training: All replicas perform the same number of updates and see the same averaged gradients, maintaining model consistency.
Usage
The training loop is the core execution step:
- Initialize the TrainingRunner with parameters and environment.
- Create DataLoader instances for training and test data.
- Call TrainingRunner::Run() with the data loaders.
- The method returns when all num_train_steps are completed.
- The trained model parameters are available for export.