Environment:Microsoft Onnxruntime Distributed Training Environment
| Field | Value |
|---|---|
| sources | setup.py, requirements-training.txt, orttraining/orttraining/python/training/ortmodule, onnxruntime/core/providers/nccl, docs/ORTModule_Training_Guidelines.md |
| domains | distributed-training, multi-gpu, mpi, nccl, cuda |
| last_updated | 2026-02-10 |
Overview
Multi-GPU distributed training environment using MPI and NCCL for data-parallel and model-parallel training with the onnxruntime-training package on NVIDIA GPUs.
Description
The Distributed Training Environment extends the CUDA GPU Environment with multi-node, multi-GPU training capabilities. It relies on MPI (Message Passing Interface) for process coordination and NCCL (NVIDIA Collective Communications Library) for high-bandwidth GPU-to-GPU communication. The NCCL integration is implemented through NcclContext in nccl_common.cc and collective operations (AllReduce, AllGather, ReduceScatter) in nccl_kernels.cc. Horovod support is also available through horovod_kernels.h for users who prefer the Horovod distributed training framework. Building from source requires the --use_mpi flag along with CUDA, cuDNN, and NCCL installation paths. The training package is distributed under the name onnxruntime-training and includes additional Python dependencies such as cerberus for configuration validation, h5py for checkpoint I/O, and onnx for graph manipulation. The environment variable ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION can be set to control the package versioning scheme.
Usage
Use this environment whenever you need to:
- Train large models across multiple NVIDIA GPUs on a single node or across multiple nodes.
- Perform data-parallel distributed training with gradient synchronization via NCCL.
- Use Horovod-based distributed training with ONNX Runtime as the backend.
- Fine-tune large language models or vision models that do not fit in a single GPU's memory.
- Save and restore distributed training checkpoints.
System Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| NVIDIA GPU | 2x GPUs, Compute Capability 7.0+ | 8x A100 or H100 GPUs |
| CUDA Toolkit | 11.8 | 12.x |
| cuDNN | 8.x | 9.x |
| NCCL | 2.10+ | 2.18+ |
| MPI | OpenMPI 4.0+ or MPICH | OpenMPI 4.1+ |
| Python | 3.10 | 3.12 |
| Operating System | Linux x86_64 | Ubuntu 22.04+ x86_64 |
| RAM | 32 GB | 256 GB+ |
| GPU VRAM | 16 GB per GPU | 40-80 GB per GPU |
| Network | 10 GbE (multi-node) | InfiniBand HDR/NDR (multi-node) |
Dependencies
System Packages
| Package | Version | Purpose |
|---|---|---|
| CUDA Toolkit | 11.8+ | GPU compute runtime |
| cuDNN | 8.x or 9.x | Neural network acceleration |
| NCCL | 2.10+ | GPU collective communications |
| OpenMPI or MPICH | 4.0+ | Process management and inter-node communication |
| NVIDIA Driver | 470+ | GPU kernel driver |
Python Packages (requirements-training.txt)
| Package | Version Constraint | Purpose |
|---|---|---|
| onnxruntime-training | 1.25.0 | Distributed training runtime (setup.py L687) |
| cerberus | (latest) | Configuration schema validation |
| flatbuffers | (latest) | Model serialization |
| h5py | (latest) | HDF5 checkpoint read/write |
| numpy | >= 1.16.6 | Tensor operations |
| onnx | (latest) | ONNX graph construction and manipulation |
| packaging | (latest) | Version utilities |
| protobuf | (latest) | Protocol buffer support |
| sympy | (latest) | Symbolic math for shape inference |
| setuptools | >= 61.0.0 | Build system |
Credentials
| Variable | Purpose | Required |
|---|---|---|
CUDA_HOME |
Path to CUDA toolkit installation | Yes (build from source) |
CUDNN_HOME |
Path to cuDNN installation | Yes (build from source) |
CUDACXX |
Path to CUDA C++ compiler (nvcc) | Yes (build from source) |
nccl_home |
Path to NCCL installation (build argument) | Yes (build from source) |
ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION |
Disables local version suffix in package version string (setup.py L689) | No |
ORT_CUDA_UNAVAILABLE |
Suppresses CUDA provider registration | No |
Quick Install
Pre-built wheel:
pip install onnxruntime-training==1.25.0
Install training dependencies:
pip install -r requirements-training.txt
Build from source with MPI and CUDA (ORTModule_Training_Guidelines.md L14):
export CUDA_HOME=/usr/local/cuda-11.8 export CUDNN_HOME=/usr/local/cuda-11.8 export CUDACXX=$CUDA_HOME/bin/nvcc ./build.sh --config RelWithDebInfo \ --use_cuda \ --enable_training \ --build_wheel \ --skip_tests \ --cuda_version=11.8 \ --parallel 8 \ --use_mpi
Launch distributed training (example with 4 GPUs):
mpirun -np 4 python train_distributed.py
Code Evidence
NCCL context initialization (nccl_common.cc)
# onnxruntime/core/providers/nccl/nccl_common.cc
// NcclContext initializes a NCCL communicator for the given set of GPUs.
// It manages the lifecycle of ncclComm_t and coordinates with MPI ranks
// to establish communication channels between GPUs.
NcclContext::NcclContext() {
// Initialize NCCL communicator using MPI rank information
}
The NcclContext class manages NCCL communicator creation and teardown, using MPI rank information to assign each process to its corresponding GPU.
NCCL collective kernels (nccl_kernels.cc)
# onnxruntime/core/providers/nccl/nccl_kernels.cc // Implements AllReduce, AllGather, and ReduceScatter operations // using NCCL for efficient multi-GPU gradient synchronization. // All operations are registered as CUDA-only kernel implementations.
The NCCL kernels file registers collective communication operations (AllReduce, AllGather, ReduceScatter) that are essential for gradient synchronization during distributed training.
Horovod integration (horovod_kernels.h)
# onnxruntime/core/providers/horovod/horovod_kernels.h // Horovod kernel declarations for distributed training. // Provides an alternative to raw NCCL for users of the Horovod framework.
Training package name (setup.py:687)
# setup.py:687 name="onnxruntime-training",
Common Errors
| Error | Cause | Solution |
|---|---|---|
NCCL error: unhandled system error |
NCCL version incompatible with driver or network configuration | Upgrade NCCL to match CUDA version; check nvidia-smi and NCCL logs
|
MPI_Init failed |
MPI not installed or misconfigured | Install OpenMPI: apt-get install libopenmpi-dev and verify with mpirun --version
|
NCCL WARN: Connect to ... failed |
Network connectivity issue between nodes | Check firewall rules; ensure all nodes can reach each other on NCCL ports |
RuntimeError: CUDA out of memory |
Model or batch too large for available VRAM | Reduce per-GPU batch size, enable gradient checkpointing, or use more GPUs |
ImportError: No module named 'onnxruntime.training' |
Installed onnxruntime instead of onnxruntime-training | Uninstall onnxruntime and install: pip install onnxruntime-training
|
All processes must call NCCL collectively |
Rank mismatch or deadlock in collective operation | Ensure all MPI ranks execute the same training loop and collective calls in the same order |
Compatibility Notes
- MPI implementations: OpenMPI and MPICH are both supported. Intel MPI may work but is not officially tested.
- NCCL versions: NCCL 2.10+ is required. For CUDA 12.x, use NCCL 2.18 or newer for best performance and stability.
- Multi-node training: Requires passwordless SSH between nodes, shared filesystem for checkpoints, and low-latency networking (InfiniBand recommended).
- Horovod: Horovod support is an alternative to native NCCL-based distribution. Install Horovod separately:
pip install horovod. - Linux only: Distributed training with MPI and NCCL is only supported on Linux. Windows and macOS are not supported for multi-GPU training.
- Checkpoint compatibility: Checkpoints saved with distributed training contain sharded state. Use the provided checkpoint utilities to merge or reshape for different GPU counts.
- Package conflict: The
onnxruntimeandonnxruntime-trainingpackages cannot be installed simultaneously. Uninstall one before installing the other.
Related Pages
- Implementation:Microsoft_Onnxruntime_TrainingRunner_Parameters
- Implementation:Microsoft_Onnxruntime_TrainingRunner_Initialize
- Implementation:Microsoft_Onnxruntime_DataLoader_Init
- Implementation:Microsoft_Onnxruntime_TrainingRunner_Run
- Implementation:Microsoft_Onnxruntime_Checkpoint_Save_Load
- Implementation:Microsoft_Onnxruntime_Summary_Ops