Implementation:Microsoft Onnxruntime CUDA NcclKernels

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for NCCL AllReduce, AllGather, and ReduceScatter collective operations in the ONNX Runtime CUDA training framework.

Description

Implements three NCCL collective operators for CUDA: (1) NcclAllReduce performs an all-reduce sum across all inputs as a single contiguous buffer. It computes the total byte range from first to last tensor (including padding gaps) and calls ncclAllReduce on the entire buffer. (2) NcclAllGather gathers data from all ranks. It pads the total element count to align with 32 bytes and world size, copies each rank's slice into a fusion buffer, calls ncclAllGather, then copies results to output tensors. (3) NcclReduceScatter reduces and scatters data across ranks. It similarly pads for alignment, copies all inputs to a fusion buffer, calls ncclReduceScatter, and copies the relevant slice to outputs. All three operators use variadic aliasing, contiguous input allocation, and support all IEEE float types.

Usage

Used during distributed training for gradient synchronization (AllReduce), parameter gathering (AllGather in ZeRO), and gradient scattering (ReduceScatter in ZeRO).

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/collective/nccl_kernels.cc
Lines: 1-249

Signature

class NcclAllReduce : public NcclKernel {
  NcclAllReduce(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

class NcclAllGather : public NcclKernel {
  NcclAllGather(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

class NcclReduceScatter : public NcclKernel {
  NcclReduceScatter(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

Import

#include "orttraining/training_ops/cuda/collective/nccl_kernels.h"

I/O Contract

Inputs

Name	Type	Required	Description
tensors	Tensor(T)...	Yes	Variadic input tensors (IEEE float types, contiguously allocated)

Outputs

Name	Type	Description
output_tensors	Tensor(T)...	Reduced/gathered/scattered output tensors (one per input, aliased)

Usage Examples

ONNX_OPERATOR_KERNEL_EX(
    NcclAllReduce, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .VariadicAlias(0, 0)
        .AllocateInputsContiguously()
        .TypeConstraint("T", DataTypeImpl::AllIEEEFloatTensorTypes()),
    NcclAllReduce);

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment