Implementation:Microsoft Onnxruntime CUDA NcclKernels
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for NCCL AllReduce, AllGather, and ReduceScatter collective operations in the ONNX Runtime CUDA training framework.
Description
Implements three NCCL collective operators for CUDA: (1) NcclAllReduce performs an all-reduce sum across all inputs as a single contiguous buffer. It computes the total byte range from first to last tensor (including padding gaps) and calls ncclAllReduce on the entire buffer. (2) NcclAllGather gathers data from all ranks. It pads the total element count to align with 32 bytes and world size, copies each rank's slice into a fusion buffer, calls ncclAllGather, then copies results to output tensors. (3) NcclReduceScatter reduces and scatters data across ranks. It similarly pads for alignment, copies all inputs to a fusion buffer, calls ncclReduceScatter, and copies the relevant slice to outputs. All three operators use variadic aliasing, contiguous input allocation, and support all IEEE float types.
Usage
Used during distributed training for gradient synchronization (AllReduce), parameter gathering (AllGather in ZeRO), and gradient scattering (ReduceScatter in ZeRO).
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/collective/nccl_kernels.cc
- Lines: 1-249
Signature
class NcclAllReduce : public NcclKernel {
NcclAllReduce(const OpKernelInfo& info);
Status ComputeInternal(OpKernelContext* context) const;
};
class NcclAllGather : public NcclKernel {
NcclAllGather(const OpKernelInfo& info);
Status ComputeInternal(OpKernelContext* context) const;
};
class NcclReduceScatter : public NcclKernel {
NcclReduceScatter(const OpKernelInfo& info);
Status ComputeInternal(OpKernelContext* context) const;
};
Import
#include "orttraining/training_ops/cuda/collective/nccl_kernels.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tensors | Tensor(T)... | Yes | Variadic input tensors (IEEE float types, contiguously allocated) |
Outputs
| Name | Type | Description |
|---|---|---|
| output_tensors | Tensor(T)... | Reduced/gathered/scattered output tensors (one per input, aliased) |
Usage Examples
ONNX_OPERATOR_KERNEL_EX(
NcclAllReduce, kMSDomain, 1, kCudaExecutionProvider,
(*KernelDefBuilder::Create())
.VariadicAlias(0, 0)
.AllocateInputsContiguously()
.TypeConstraint("T", DataTypeImpl::AllIEEEFloatTensorTypes()),
NcclAllReduce);