Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft Onnxruntime Distributed Training Configuration

From Leeroopedia


Field Value
Principle Name Distributed_Training_Configuration
Overview Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training.
Category API Doc
Domains Distributed_Training, Training_Infrastructure
Source Repository microsoft/onnxruntime
Last Updated 2026-02-10

Overview

Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training. The TrainingRunner::Parameters struct encapsulates the complete specification for a distributed training run, from model topology to parallelism strategy.

Description

The TrainingRunner::Parameters struct encapsulates all configuration for distributed training, organized into several categories:

Model and Data Paths

  • model_path: Path to the ONNX model file to train.
  • model_with_loss_func_path: Path to save the model after adding the loss function.
  • model_with_training_graph_path: Path to save the model after adding loss and backward graph.
  • train_data_dir: Directory containing training data shards in protobuf format.
  • test_data_dir: Directory containing evaluation/test data shards.
  • output_dir: Directory for training output files (trained model).

Training Hyperparameters

  • batch_size: Number of samples per training batch.
  • num_train_steps: Total number of training steps to execute.
  • learning_rate: Configured via lr_params (LearningRateParameters struct).
  • gradient_accumulation_steps: Number of forward/backward passes before a weight update.
  • display_loss_steps: Interval for logging loss values.

Optimizer Configuration

  • training_optimizer_name: Optimizer algorithm selection - supports "Adam", "Lamb", and "SGDOptimizer" (default: "SGDOptimizer").
  • optimizer_attributes: Per-weight float attributes for the optimizer.
  • optimizer_int_attributes: Per-weight integer attributes for the optimizer.

Mixed Precision Settings

  • use_mixed_precision: Enable FP16 mixed precision training.
  • use_bfloat16: Use BF16 instead of FP16 for mixed precision.
  • loss_scale: Static loss scale value (0.0 enables dynamic loss scaling).

Distributed Parallelism

  • data_parallel_size: Number of data-parallel replicas.
  • horizontal_parallel_size: Size of horizontal (tensor) model parallelism.
  • pipeline_parallel_size: Number of pipeline parallel stages (1 means disabled).
  • use_nccl: Enable NCCL for GPU-to-GPU communication.
  • enable_adasum: Use Adasum for allreduce operations.

Checkpoint Configuration

  • checkpoints_dir: Directory for saving/loading checkpoint files.
  • checkpoint_period: Interval in weight-update steps between checkpoints.
  • max_num_checkpoints: Maximum number of checkpoint files to retain.
  • checkpoint_to_load_path: Specific checkpoint to resume from.

TensorBoard Configuration

  • log_dir: Path to write TensorBoard events.
  • scalar_names: Names of scalar values to log.
  • histogram_names: Names of histograms to log.

Theoretical Basis

Distributed training configuration must specify the parallelism strategy (data parallel, pipeline parallel, model parallel), optimizer algorithm, learning rate schedule, and fault tolerance settings (checkpointing). The configuration defines a complete training experiment:

  • Data parallelism replicates the model across GPUs, each processing different data, then synchronizes gradients.
  • Pipeline parallelism partitions the model across GPUs sequentially, enabling training of models too large for a single GPU's memory.
  • Horizontal (tensor) parallelism splits individual layers across GPUs for very wide models.

These three parallelism dimensions can be combined (3D parallelism) and their sizes must satisfy: data_parallel_size * horizontal_parallel_size * pipeline_parallel_size = world_size.

The optimizer choice (Adam, Lamb, SGD) determines the update rule and memory requirements (Adam and Lamb maintain first and second moment estimates per parameter). Mixed precision configuration reduces memory and computation by using FP16/BF16 for forward/backward passes while maintaining FP32 master weights.

Usage

Configuration is the first step in setting up a distributed training pipeline:

  1. Populate a TrainingRunner::Parameters struct with model paths, hyperparameters, optimizer settings, and parallelism configuration.
  2. Validate that parallelism dimensions are consistent with the total MPI world size.
  3. Ensure num_train_steps is a multiple of gradient_accumulation_steps.
  4. Pass the parameters to the TrainingRunner constructor.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment