Principle:Microsoft Onnxruntime Distributed Training Configuration

Field	Value
Principle Name	Distributed_Training_Configuration
Overview	Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training.
Category	API Doc
Domains	Distributed_Training, Training_Infrastructure
Source Repository	microsoft/onnxruntime
Last Updated	2026-02-10

Overview

Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training. The TrainingRunner::Parameters struct encapsulates the complete specification for a distributed training run, from model topology to parallelism strategy.

Description

The TrainingRunner::Parameters struct encapsulates all configuration for distributed training, organized into several categories:

Model and Data Paths

model_path: Path to the ONNX model file to train.
model_with_loss_func_path: Path to save the model after adding the loss function.
model_with_training_graph_path: Path to save the model after adding loss and backward graph.
train_data_dir: Directory containing training data shards in protobuf format.
test_data_dir: Directory containing evaluation/test data shards.
output_dir: Directory for training output files (trained model).

Training Hyperparameters

batch_size: Number of samples per training batch.
num_train_steps: Total number of training steps to execute.
learning_rate: Configured via lr_params (LearningRateParameters struct).
gradient_accumulation_steps: Number of forward/backward passes before a weight update.
display_loss_steps: Interval for logging loss values.

Optimizer Configuration

training_optimizer_name: Optimizer algorithm selection - supports "Adam", "Lamb", and "SGDOptimizer" (default: "SGDOptimizer").
optimizer_attributes: Per-weight float attributes for the optimizer.
optimizer_int_attributes: Per-weight integer attributes for the optimizer.

Mixed Precision Settings

use_mixed_precision: Enable FP16 mixed precision training.
use_bfloat16: Use BF16 instead of FP16 for mixed precision.
loss_scale: Static loss scale value (0.0 enables dynamic loss scaling).

Distributed Parallelism

data_parallel_size: Number of data-parallel replicas.
horizontal_parallel_size: Size of horizontal (tensor) model parallelism.
pipeline_parallel_size: Number of pipeline parallel stages (1 means disabled).
use_nccl: Enable NCCL for GPU-to-GPU communication.
enable_adasum: Use Adasum for allreduce operations.

Checkpoint Configuration

checkpoints_dir: Directory for saving/loading checkpoint files.
checkpoint_period: Interval in weight-update steps between checkpoints.
max_num_checkpoints: Maximum number of checkpoint files to retain.
checkpoint_to_load_path: Specific checkpoint to resume from.

TensorBoard Configuration

log_dir: Path to write TensorBoard events.
scalar_names: Names of scalar values to log.
histogram_names: Names of histograms to log.

Theoretical Basis

Distributed training configuration must specify the parallelism strategy (data parallel, pipeline parallel, model parallel), optimizer algorithm, learning rate schedule, and fault tolerance settings (checkpointing). The configuration defines a complete training experiment:

Data parallelism replicates the model across GPUs, each processing different data, then synchronizes gradients.
Pipeline parallelism partitions the model across GPUs sequentially, enabling training of models too large for a single GPU's memory.
Horizontal (tensor) parallelism splits individual layers across GPUs for very wide models.

These three parallelism dimensions can be combined (3D parallelism) and their sizes must satisfy: data_parallel_size * horizontal_parallel_size * pipeline_parallel_size = world_size.

The optimizer choice (Adam, Lamb, SGD) determines the update rule and memory requirements (Adam and Lamb maintain first and second moment estimates per parameter). Mixed precision configuration reduces memory and computation by using FP16/BF16 for forward/backward passes while maintaining FP32 master weights.

Usage

Configuration is the first step in setting up a distributed training pipeline:

Populate a TrainingRunner::Parameters struct with model paths, hyperparameters, optimizer settings, and parallelism configuration.
Validate that parallelism dimensions are consistent with the total MPI world size.
Ensure num_train_steps is a multiple of gradient_accumulation_steps.
Pass the parameters to the TrainingRunner constructor.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment