Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft Onnxruntime Distributed Training Environment

From Leeroopedia


Field Value
sources setup.py, requirements-training.txt, orttraining/orttraining/python/training/ortmodule, onnxruntime/core/providers/nccl, docs/ORTModule_Training_Guidelines.md
domains distributed-training, multi-gpu, mpi, nccl, cuda
last_updated 2026-02-10

Overview

Multi-GPU distributed training environment using MPI and NCCL for data-parallel and model-parallel training with the onnxruntime-training package on NVIDIA GPUs.

Description

The Distributed Training Environment extends the CUDA GPU Environment with multi-node, multi-GPU training capabilities. It relies on MPI (Message Passing Interface) for process coordination and NCCL (NVIDIA Collective Communications Library) for high-bandwidth GPU-to-GPU communication. The NCCL integration is implemented through NcclContext in nccl_common.cc and collective operations (AllReduce, AllGather, ReduceScatter) in nccl_kernels.cc. Horovod support is also available through horovod_kernels.h for users who prefer the Horovod distributed training framework. Building from source requires the --use_mpi flag along with CUDA, cuDNN, and NCCL installation paths. The training package is distributed under the name onnxruntime-training and includes additional Python dependencies such as cerberus for configuration validation, h5py for checkpoint I/O, and onnx for graph manipulation. The environment variable ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION can be set to control the package versioning scheme.

Usage

Use this environment whenever you need to:

  • Train large models across multiple NVIDIA GPUs on a single node or across multiple nodes.
  • Perform data-parallel distributed training with gradient synchronization via NCCL.
  • Use Horovod-based distributed training with ONNX Runtime as the backend.
  • Fine-tune large language models or vision models that do not fit in a single GPU's memory.
  • Save and restore distributed training checkpoints.

System Requirements

Requirement Minimum Recommended
NVIDIA GPU 2x GPUs, Compute Capability 7.0+ 8x A100 or H100 GPUs
CUDA Toolkit 11.8 12.x
cuDNN 8.x 9.x
NCCL 2.10+ 2.18+
MPI OpenMPI 4.0+ or MPICH OpenMPI 4.1+
Python 3.10 3.12
Operating System Linux x86_64 Ubuntu 22.04+ x86_64
RAM 32 GB 256 GB+
GPU VRAM 16 GB per GPU 40-80 GB per GPU
Network 10 GbE (multi-node) InfiniBand HDR/NDR (multi-node)

Dependencies

System Packages

Package Version Purpose
CUDA Toolkit 11.8+ GPU compute runtime
cuDNN 8.x or 9.x Neural network acceleration
NCCL 2.10+ GPU collective communications
OpenMPI or MPICH 4.0+ Process management and inter-node communication
NVIDIA Driver 470+ GPU kernel driver

Python Packages (requirements-training.txt)

Package Version Constraint Purpose
onnxruntime-training 1.25.0 Distributed training runtime (setup.py L687)
cerberus (latest) Configuration schema validation
flatbuffers (latest) Model serialization
h5py (latest) HDF5 checkpoint read/write
numpy >= 1.16.6 Tensor operations
onnx (latest) ONNX graph construction and manipulation
packaging (latest) Version utilities
protobuf (latest) Protocol buffer support
sympy (latest) Symbolic math for shape inference
setuptools >= 61.0.0 Build system

Credentials

Variable Purpose Required
CUDA_HOME Path to CUDA toolkit installation Yes (build from source)
CUDNN_HOME Path to cuDNN installation Yes (build from source)
CUDACXX Path to CUDA C++ compiler (nvcc) Yes (build from source)
nccl_home Path to NCCL installation (build argument) Yes (build from source)
ORT_DISABLE_PYTHON_PACKAGE_LOCAL_VERSION Disables local version suffix in package version string (setup.py L689) No
ORT_CUDA_UNAVAILABLE Suppresses CUDA provider registration No

Quick Install

Pre-built wheel:

pip install onnxruntime-training==1.25.0

Install training dependencies:

pip install -r requirements-training.txt

Build from source with MPI and CUDA (ORTModule_Training_Guidelines.md L14):

export CUDA_HOME=/usr/local/cuda-11.8
export CUDNN_HOME=/usr/local/cuda-11.8
export CUDACXX=$CUDA_HOME/bin/nvcc

./build.sh --config RelWithDebInfo \
  --use_cuda \
  --enable_training \
  --build_wheel \
  --skip_tests \
  --cuda_version=11.8 \
  --parallel 8 \
  --use_mpi

Launch distributed training (example with 4 GPUs):

mpirun -np 4 python train_distributed.py

Code Evidence

NCCL context initialization (nccl_common.cc)

# onnxruntime/core/providers/nccl/nccl_common.cc
// NcclContext initializes a NCCL communicator for the given set of GPUs.
// It manages the lifecycle of ncclComm_t and coordinates with MPI ranks
// to establish communication channels between GPUs.
NcclContext::NcclContext() {
  // Initialize NCCL communicator using MPI rank information
}

The NcclContext class manages NCCL communicator creation and teardown, using MPI rank information to assign each process to its corresponding GPU.

NCCL collective kernels (nccl_kernels.cc)

# onnxruntime/core/providers/nccl/nccl_kernels.cc
// Implements AllReduce, AllGather, and ReduceScatter operations
// using NCCL for efficient multi-GPU gradient synchronization.
// All operations are registered as CUDA-only kernel implementations.

The NCCL kernels file registers collective communication operations (AllReduce, AllGather, ReduceScatter) that are essential for gradient synchronization during distributed training.

Horovod integration (horovod_kernels.h)

# onnxruntime/core/providers/horovod/horovod_kernels.h
// Horovod kernel declarations for distributed training.
// Provides an alternative to raw NCCL for users of the Horovod framework.

Training package name (setup.py:687)

# setup.py:687
name="onnxruntime-training",

Common Errors

Error Cause Solution
NCCL error: unhandled system error NCCL version incompatible with driver or network configuration Upgrade NCCL to match CUDA version; check nvidia-smi and NCCL logs
MPI_Init failed MPI not installed or misconfigured Install OpenMPI: apt-get install libopenmpi-dev and verify with mpirun --version
NCCL WARN: Connect to ... failed Network connectivity issue between nodes Check firewall rules; ensure all nodes can reach each other on NCCL ports
RuntimeError: CUDA out of memory Model or batch too large for available VRAM Reduce per-GPU batch size, enable gradient checkpointing, or use more GPUs
ImportError: No module named 'onnxruntime.training' Installed onnxruntime instead of onnxruntime-training Uninstall onnxruntime and install: pip install onnxruntime-training
All processes must call NCCL collectively Rank mismatch or deadlock in collective operation Ensure all MPI ranks execute the same training loop and collective calls in the same order

Compatibility Notes

  • MPI implementations: OpenMPI and MPICH are both supported. Intel MPI may work but is not officially tested.
  • NCCL versions: NCCL 2.10+ is required. For CUDA 12.x, use NCCL 2.18 or newer for best performance and stability.
  • Multi-node training: Requires passwordless SSH between nodes, shared filesystem for checkpoints, and low-latency networking (InfiniBand recommended).
  • Horovod: Horovod support is an alternative to native NCCL-based distribution. Install Horovod separately: pip install horovod.
  • Linux only: Distributed training with MPI and NCCL is only supported on Linux. Windows and macOS are not supported for multi-GPU training.
  • Checkpoint compatibility: Checkpoints saved with distributed training contain sharded state. Use the provided checkpoint utilities to merge or reshape for different GPU counts.
  • Package conflict: The onnxruntime and onnxruntime-training packages cannot be installed simultaneously. Uninstall one before installing the other.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment