Principle:Microsoft Onnxruntime Distributed Runtime Initialization
| Field | Value |
|---|---|
| Principle Name | Distributed_Runtime_Initialization |
| Overview | Initialization of the distributed training runtime including MPI processes, NCCL communicators, and CUDA execution providers. |
| Category | API Doc |
| Domains | Distributed_Training, Training_Infrastructure |
| Source Repository | microsoft/onnxruntime |
| Last Updated | 2026-02-10 |
Overview
Initialization of the distributed training runtime including MPI processes, NCCL communicators, and CUDA execution providers. The TrainingRunner constructor and Initialize() method orchestrate the multi-process setup required for distributed model training across GPUs and nodes.
Description
Runtime initialization sets up the multi-process training environment through several coordinated steps:
- MPI context: The MPIContext singleton provides process coordination, exposing world rank, world size, local rank, and local size. MPI is used for process management and communication between nodes.
- NCCL communicators: When use_nccl is enabled, NCCL (NVIDIA Collective Communications Library) is configured for optimized GPU-to-GPU gradient synchronization operations like AllReduce.
- CUDA execution providers: GPU computation is enabled by registering CUDA execution providers with the training session. The providers map in the parameters struct controls which execution providers are activated.
- ORT Environment: The ONNX Runtime Environment provides logging configuration and thread management for the training process.
The initialization sequence in the TrainingRunner is:
- Constructor: Validates parameters (model path, optimizer name, gradient accumulation steps). Enforces that DeepSpeed ZeRO requires NCCL.
- Initialize(): Loads the model (either full model or pre-partitioned pipeline stage based on MPI rank), configures the TrainingSession with distributed settings, optimizer, mixed precision, loss function, TensorBoard, and graph transformers. Registers execution providers. Initializes the session. Sets up checkpointing and loads any existing checkpoint.
The distributed configuration within Initialize() explicitly maps MPI context values:
- world_rank and world_size from MPIContext::GetInstance()
- data_parallel_size, horizontal_parallel_size, and pipeline_parallel_size from parameters
Theoretical Basis
Distributed training requires coordinated initialization across processes (ranks). MPI provides the process management layer, NCCL provides optimized GPU collective operations, and CUDA provides the GPU execution context.
The initialization order is critical:
- MPI must initialize first to establish the process group and assign ranks.
- Model loading must happen before graph transformations (the pipeline partitioned model may differ per rank).
- Session configuration (optimizer, loss function, mixed precision) must happen before session initialization because these modify the graph.
- Execution provider registration must happen before session initialization because it determines where operators will run.
- Checkpoint loading must happen after session initialization because the session graph must be finalized before loading parameter values.
This ordering ensures that all processes have a consistent view of the training graph and can begin training from the same state.
Usage
Runtime initialization follows configuration:
- Set up MPI (typically via mpirun or mpiexec at launch time).
- Create an ONNX Runtime Environment with logging configuration.
- Construct a TrainingRunner with the Parameters struct and Environment.
- Call Initialize() to complete the setup.
- The runner is now ready to execute the training loop.