Principle:Microsoft Onnxruntime Distributed Training Configuration
| Field | Value |
|---|---|
| Principle Name | Distributed_Training_Configuration |
| Overview | Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training. |
| Category | API Doc |
| Domains | Distributed_Training, Training_Infrastructure |
| Source Repository | microsoft/onnxruntime |
| Last Updated | 2026-02-10 |
Overview
Configuration of training hyperparameters, model paths, optimizer selection, and distributed settings for large-scale training. The TrainingRunner::Parameters struct encapsulates the complete specification for a distributed training run, from model topology to parallelism strategy.
Description
The TrainingRunner::Parameters struct encapsulates all configuration for distributed training, organized into several categories:
Model and Data Paths
- model_path: Path to the ONNX model file to train.
- model_with_loss_func_path: Path to save the model after adding the loss function.
- model_with_training_graph_path: Path to save the model after adding loss and backward graph.
- train_data_dir: Directory containing training data shards in protobuf format.
- test_data_dir: Directory containing evaluation/test data shards.
- output_dir: Directory for training output files (trained model).
Training Hyperparameters
- batch_size: Number of samples per training batch.
- num_train_steps: Total number of training steps to execute.
- learning_rate: Configured via lr_params (LearningRateParameters struct).
- gradient_accumulation_steps: Number of forward/backward passes before a weight update.
- display_loss_steps: Interval for logging loss values.
Optimizer Configuration
- training_optimizer_name: Optimizer algorithm selection - supports "Adam", "Lamb", and "SGDOptimizer" (default: "SGDOptimizer").
- optimizer_attributes: Per-weight float attributes for the optimizer.
- optimizer_int_attributes: Per-weight integer attributes for the optimizer.
Mixed Precision Settings
- use_mixed_precision: Enable FP16 mixed precision training.
- use_bfloat16: Use BF16 instead of FP16 for mixed precision.
- loss_scale: Static loss scale value (0.0 enables dynamic loss scaling).
Distributed Parallelism
- data_parallel_size: Number of data-parallel replicas.
- horizontal_parallel_size: Size of horizontal (tensor) model parallelism.
- pipeline_parallel_size: Number of pipeline parallel stages (1 means disabled).
- use_nccl: Enable NCCL for GPU-to-GPU communication.
- enable_adasum: Use Adasum for allreduce operations.
Checkpoint Configuration
- checkpoints_dir: Directory for saving/loading checkpoint files.
- checkpoint_period: Interval in weight-update steps between checkpoints.
- max_num_checkpoints: Maximum number of checkpoint files to retain.
- checkpoint_to_load_path: Specific checkpoint to resume from.
TensorBoard Configuration
- log_dir: Path to write TensorBoard events.
- scalar_names: Names of scalar values to log.
- histogram_names: Names of histograms to log.
Theoretical Basis
Distributed training configuration must specify the parallelism strategy (data parallel, pipeline parallel, model parallel), optimizer algorithm, learning rate schedule, and fault tolerance settings (checkpointing). The configuration defines a complete training experiment:
- Data parallelism replicates the model across GPUs, each processing different data, then synchronizes gradients.
- Pipeline parallelism partitions the model across GPUs sequentially, enabling training of models too large for a single GPU's memory.
- Horizontal (tensor) parallelism splits individual layers across GPUs for very wide models.
These three parallelism dimensions can be combined (3D parallelism) and their sizes must satisfy: data_parallel_size * horizontal_parallel_size * pipeline_parallel_size = world_size.
The optimizer choice (Adam, Lamb, SGD) determines the update rule and memory requirements (Adam and Lamb maintain first and second moment estimates per parameter). Mixed precision configuration reduces memory and computation by using FP16/BF16 for forward/backward passes while maintaining FP32 master weights.
Usage
Configuration is the first step in setting up a distributed training pipeline:
- Populate a TrainingRunner::Parameters struct with model paths, hyperparameters, optimizer settings, and parallelism configuration.
- Validate that parallelism dimensions are consistent with the total MPI world size.
- Ensure num_train_steps is a multiple of gradient_accumulation_steps.
- Pass the parameters to the TrainingRunner constructor.