Environment:Microsoft Onnxruntime Distributed Training Environment

Field	Value
sources	setup.py, requirements-training.txt, orttraining/orttraining/python/training/ortmodule, onnxruntime/core/providers/nccl, docs/ORTModule_Training_Guidelines.md
domains	distributed-training, multi-gpu, mpi, nccl, cuda
last_updated	2026-02-10

Overview

Multi-GPU distributed training environment using MPI and NCCL for data-parallel and model-parallel training with the onnxruntime-training package on NVIDIA GPUs.

Description

The Distributed Training Environment extends the CUDA GPU Environment with multi-node, multi-GPU training capabilities. It relies on MPI (Message Passing Interface) for process coordination and NCCL (NVIDIA Collective Communications Library) for high-bandwidth GPU-to-GPU communication. The NCCL integration is implemented through NcclContext in nccl_common.cc and collective operations (AllReduce, AllGather, ReduceScatter) in nccl_kernels.cc. Horovod support is also available through horovod_kernels.h for users who prefer the Horovod distributed training framework. Building from source requires the --use_mpi flag along with CUDA, cuDNN, and NCCL installation paths. The training package is distributed under the name onnxruntime-training and includes additional Python dependencies such as cerberus for configuration validation, h5py for checkpoint I/O, and onnx for graph manipulation. The environment variable ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION can be set to control the package versioning scheme.

Usage

Use this environment whenever you need to:

Train large models across multiple NVIDIA GPUs on a single node or across multiple nodes.
Perform data-parallel distributed training with gradient synchronization via NCCL.
Use Horovod-based distributed training with ONNX Runtime as the backend.
Fine-tune large language models or vision models that do not fit in a single GPU's memory.
Save and restore distributed training checkpoints.

System Requirements

Requirement	Minimum	Recommended
NVIDIA GPU	2x GPUs, Compute Capability 7.0+	8x A100 or H100 GPUs
CUDA Toolkit	11.8	12.x
cuDNN	8.x	9.x
NCCL	2.10+	2.18+
MPI	OpenMPI 4.0+ or MPICH	OpenMPI 4.1+
Python	3.10	3.12
Operating System	Linux x86_64	Ubuntu 22.04+ x86_64
RAM	32 GB	256 GB+
GPU VRAM	16 GB per GPU	40-80 GB per GPU
Network	10 GbE (multi-node)	InfiniBand HDR/NDR (multi-node)

Dependencies

System Packages

Package	Version	Purpose
CUDA Toolkit	11.8+	GPU compute runtime
cuDNN	8.x or 9.x	Neural network acceleration
NCCL	2.10+	GPU collective communications
OpenMPI or MPICH	4.0+	Process management and inter-node communication
NVIDIA Driver	470+	GPU kernel driver

Python Packages (requirements-training.txt)

Package	Version Constraint	Purpose
onnxruntime-training	1.25.0	Distributed training runtime (setup.py L687)
cerberus	(latest)	Configuration schema validation
flatbuffers	(latest)	Model serialization
h5py	(latest)	HDF5 checkpoint read/write
numpy	>= 1.16.6	Tensor operations
onnx	(latest)	ONNX graph construction and manipulation
packaging	(latest)	Version utilities
protobuf	(latest)	Protocol buffer support
sympy	(latest)	Symbolic math for shape inference
setuptools	>= 61.0.0	Build system

Credentials

Variable	Purpose	Required
`CUDA_HOME`	Path to CUDA toolkit installation	Yes (build from source)
`CUDNN_HOME`	Path to cuDNN installation	Yes (build from source)
`CUDACXX`	Path to CUDA C++ compiler (nvcc)	Yes (build from source)
`nccl_home`	Path to NCCL installation (build argument)	Yes (build from source)
`ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION`	Disables local version suffix in package version string (setup.py L689)	No
`ORT_CUDA_UNAVAILABLE`	Suppresses CUDA provider registration	No

Quick Install

Pre-built wheel:

pip install onnxruntime-training==1.25.0

Install training dependencies:

pip install -r requirements-training.txt

Build from source with MPI and CUDA (ORTModule_Training_Guidelines.md L14):

export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

./build.sh --config RelWithDebInfo \
  --use_cuda \
  --enable_training \
  --build_wheel \
  --skip_tests \
  --cuda_version=11.8 \
  --parallel 8 \
  --use_mpi

Launch distributed training (example with 4 GPUs):

mpirun -np 4 python train_distributed.py

Code Evidence

NCCL context initialization (nccl_common.cc)

# onnxruntime/core/providers/nccl/nccl_common.cc
// NcclContext initializes a NCCL communicator for the given set of GPUs.
// It manages the lifecycle of ncclComm_t and coordinates with MPI ranks
// to establish communication channels between GPUs.
NcclContext::NcclContext() {
  // Initialize NCCL communicator using MPI rank information
}

The NcclContext class manages NCCL communicator creation and teardown, using MPI rank information to assign each process to its corresponding GPU.

NCCL collective kernels (nccl_kernels.cc)

# onnxruntime/core/providers/nccl/nccl_kernels.cc
// Implements AllReduce, AllGather, and ReduceScatter operations
// using NCCL for efficient multi-GPU gradient synchronization.
// All operations are registered as CUDA-only kernel implementations.

The NCCL kernels file registers collective communication operations (AllReduce, AllGather, ReduceScatter) that are essential for gradient synchronization during distributed training.

Horovod integration (horovod_kernels.h)

# onnxruntime/core/providers/horovod/horovod_kernels.h
// Horovod kernel declarations for distributed training.
// Provides an alternative to raw NCCL for users of the Horovod framework.

Training package name (setup.py:687)

# setup.py:687
name="onnxruntime-training",

Common Errors

Error	Cause	Solution
`NCCL error: unhandled system error`	NCCL version incompatible with driver or network configuration	Upgrade NCCL to match CUDA version; check `nvidia-smi` and NCCL logs
`MPI_Init failed`	MPI not installed or misconfigured	Install OpenMPI: `apt-get install libopenmpi-dev` and verify with `mpirun --version`
`NCCL WARN: Connect to ... failed`	Network connectivity issue between nodes	Check firewall rules; ensure all nodes can reach each other on NCCL ports
`RuntimeError: CUDA out of memory`	Model or batch too large for available VRAM	Reduce per-GPU batch size, enable gradient checkpointing, or use more GPUs
`ImportError: No module named 'onnxruntime.training'`	Installed onnxruntime instead of onnxruntime-training	Uninstall onnxruntime and install: `pip install onnxruntime-training`
`All processes must call NCCL collectively`	Rank mismatch or deadlock in collective operation	Ensure all MPI ranks execute the same training loop and collective calls in the same order

Compatibility Notes

MPI implementations: OpenMPI and MPICH are both supported. Intel MPI may work but is not officially tested.
NCCL versions: NCCL 2.10+ is required. For CUDA 12.x, use NCCL 2.18 or newer for best performance and stability.
Multi-node training: Requires passwordless SSH between nodes, shared filesystem for checkpoints, and low-latency networking (InfiniBand recommended).
Horovod: Horovod support is an alternative to native NCCL-based distribution. Install Horovod separately: pip install horovod.
Linux only: Distributed training with MPI and NCCL is only supported on Linux. Windows and macOS are not supported for multi-GPU training.
Checkpoint compatibility: Checkpoints saved with distributed training contain sharded state. Use the provided checkpoint utilities to merge or reshape for different GPU counts.
Package conflict: The onnxruntime and onnxruntime-training packages cannot be installed simultaneously. Uninstall one before installing the other.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment