Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft Onnxruntime TrainingRunner Parameters

From Leeroopedia


Field Value
Implementation Name TrainingRunner_Parameters
Overview Configuration struct encapsulating all training hyperparameters, model paths, optimizer selection, and distributed parallelism settings.
Type API Doc
Language C++
Domains Distributed_Training, Training_Infrastructure
Source Repository microsoft/onnxruntime
Last Updated 2026-02-10

Overview

Configuration struct encapsulating all training hyperparameters, model paths, optimizer selection, and distributed parallelism settings. The TrainingRunner::Parameters struct is the single configuration object that drives the entire distributed training pipeline.

API

struct TrainingRunner::Parameters {
    // Model and data paths
    std::string model_name;
    PathString model_path;
    PathString train_data_dir;
    PathString test_data_dir;
    PathString output_dir;

    // Training hyperparameters
    size_t batch_size;
    size_t num_train_steps;
    LearningRateParameters lr_params;
    int gradient_accumulation_steps = 1;

    // Optimizer
    std::string training_optimizer_name = "SGDOptimizer";

    // Mixed precision
    bool use_mixed_precision = false;
    bool use_bfloat16 = false;
    float loss_scale = 1.0f;

    // Distributed parallelism
    int data_parallel_size = 1;
    int horizontal_parallel_size = 1;
    int pipeline_parallel_size = 1;
    bool use_nccl = false;

    // Checkpointing
    PathString checkpoints_dir;
    size_t checkpoint_period = 0;
    size_t max_num_checkpoints = 1;

    // TensorBoard
    PathString log_dir;
    VectorString scalar_names;
    VectorString histogram_names;
    // ... additional fields
};

Source Code Reference

Key Fields

Field Type Default Description
model_path PathString (required) Path to the ONNX model file
train_data_dir PathString (required) Directory containing training data in .pb format
test_data_dir PathString Directory containing evaluation data
batch_size size_t (required) Number of samples per training batch
num_train_steps size_t (required) Total number of training steps
lr_params LearningRateParameters Learning rate configuration with feed name and schedule
training_optimizer_name string "SGDOptimizer" Optimizer: "Adam", "Lamb", or "SGDOptimizer"
gradient_accumulation_steps int 1 Micro-batches per weight update
use_mixed_precision bool false Enable FP16/BF16 mixed precision training
use_bfloat16 bool false Use BF16 instead of FP16 for mixed precision
loss_scale float 1.0f Static loss scale (0.0 enables dynamic scaling)
data_parallel_size int 1 Number of data-parallel replicas
horizontal_parallel_size int 1 Tensor model parallelism size
pipeline_parallel_size int 1 Pipeline parallelism stages (1 = disabled)
use_nccl bool false Enable NCCL for GPU collective communication
checkpoints_dir PathString (empty) Directory for checkpoint files (empty = no checkpointing)
checkpoint_period size_t 0 Steps between checkpoints (0 = no saving)
max_num_checkpoints size_t 1 Maximum retained checkpoint files
log_dir PathString (empty) TensorBoard log directory (empty = no TensorBoard)
gpu_mem_limit_in_gb float -1.0f GPU memory limit (-1.0 = use all available)

I/O Contract

Direction Name Type Description
Input Configuration values Various Model paths, hyperparameters, parallelism settings, checkpoint config
Output Parameters struct TrainingRunner::Parameters Fully configured struct passed to TrainingRunner constructor

Usage Examples

Basic Configuration

TrainingRunner::Parameters params;
params.model_name = "gpt2";
params.model_path = ORT_TSTR("model.onnx");
params.train_data_dir = ORT_TSTR("/data/train/");
params.test_data_dir = ORT_TSTR("/data/test/");
params.output_dir = ORT_TSTR("/output/");

params.batch_size = 32;
params.num_train_steps = 10000;
params.training_optimizer_name = "Adam";
params.gradient_accumulation_steps = 4;

Distributed Configuration with NCCL

params.data_parallel_size = 4;
params.horizontal_parallel_size = 1;
params.pipeline_parallel_size = 1;
params.use_nccl = true;

Mixed Precision with Checkpointing

params.use_mixed_precision = true;
params.loss_scale = 0.0f;  // dynamic loss scaling

params.checkpoints_dir = ORT_TSTR("/checkpoints/");
params.checkpoint_period = 1000;
params.max_num_checkpoints = 5;

TensorBoard Logging

params.log_dir = ORT_TSTR("/logs/tensorboard/");
params.scalar_names = {"loss", "learning_rate"};
params.histogram_names = {"weights", "gradients"};

Key Details

  • The num_train_steps must be a multiple of gradient_accumulation_steps (enforced by constructor assertion).
  • DeepSpeed ZeRO partitioning (deepspeed_zero.stage != 0) requires use_nccl = true.
  • The weights_to_train and weights_not_to_train sets are mutually exclusive.
  • EnableTensorboard() returns true only when log_dir is set, is_perf_test is false, and the current rank is 0.
  • UseCuda() checks whether a CUDA execution provider has been registered in the providers map.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment