Principle:Microsoft Onnxruntime Distributed Runtime Initialization

Field	Value
Principle Name	Distributed_Runtime_Initialization
Overview	Initialization of the distributed training runtime including MPI processes, NCCL communicators, and CUDA execution providers.
Category	API Doc
Domains	Distributed_Training, Training_Infrastructure
Source Repository	microsoft/onnxruntime
Last Updated	2026-02-10

Overview

Initialization of the distributed training runtime including MPI processes, NCCL communicators, and CUDA execution providers. The TrainingRunner constructor and Initialize() method orchestrate the multi-process setup required for distributed model training across GPUs and nodes.

Description

Runtime initialization sets up the multi-process training environment through several coordinated steps:

MPI context: The MPIContext singleton provides process coordination, exposing world rank, world size, local rank, and local size. MPI is used for process management and communication between nodes.
NCCL communicators: When use_nccl is enabled, NCCL (NVIDIA Collective Communications Library) is configured for optimized GPU-to-GPU gradient synchronization operations like AllReduce.
CUDA execution providers: GPU computation is enabled by registering CUDA execution providers with the training session. The providers map in the parameters struct controls which execution providers are activated.
ORT Environment: The ONNX Runtime Environment provides logging configuration and thread management for the training process.

The initialization sequence in the TrainingRunner is:

Constructor: Validates parameters (model path, optimizer name, gradient accumulation steps). Enforces that DeepSpeed ZeRO requires NCCL.
Initialize(): Loads the model (either full model or pre-partitioned pipeline stage based on MPI rank), configures the TrainingSession with distributed settings, optimizer, mixed precision, loss function, TensorBoard, and graph transformers. Registers execution providers. Initializes the session. Sets up checkpointing and loads any existing checkpoint.

The distributed configuration within Initialize() explicitly maps MPI context values:

world_rank and world_size from MPIContext::GetInstance()
data_parallel_size, horizontal_parallel_size, and pipeline_parallel_size from parameters

Theoretical Basis

Distributed training requires coordinated initialization across processes (ranks). MPI provides the process management layer, NCCL provides optimized GPU collective operations, and CUDA provides the GPU execution context.

The initialization order is critical:

MPI must initialize first to establish the process group and assign ranks.
Model loading must happen before graph transformations (the pipeline partitioned model may differ per rank).
Session configuration (optimizer, loss function, mixed precision) must happen before session initialization because these modify the graph.
Execution provider registration must happen before session initialization because it determines where operators will run.
Checkpoint loading must happen after session initialization because the session graph must be finalized before loading parameter values.

This ordering ensures that all processes have a consistent view of the training graph and can begin training from the same state.

Usage

Runtime initialization follows configuration:

Set up MPI (typically via mpirun or mpiexec at launch time).
Create an ONNX Runtime Environment with logging configuration.
Construct a TrainingRunner with the Parameters struct and Environment.
Call Initialize() to complete the setup.
The runner is now ready to execute the training loop.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment