Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA NcclKernels

From Leeroopedia
Revision as of 15:45, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_Onnxruntime_CUDA_NcclKernels.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for NCCL AllReduce, AllGather, and ReduceScatter collective operations in the ONNX Runtime CUDA training framework.

Description

Implements three NCCL collective operators for CUDA: (1) NcclAllReduce performs an all-reduce sum across all inputs as a single contiguous buffer. It computes the total byte range from first to last tensor (including padding gaps) and calls ncclAllReduce on the entire buffer. (2) NcclAllGather gathers data from all ranks. It pads the total element count to align with 32 bytes and world size, copies each rank's slice into a fusion buffer, calls ncclAllGather, then copies results to output tensors. (3) NcclReduceScatter reduces and scatters data across ranks. It similarly pads for alignment, copies all inputs to a fusion buffer, calls ncclReduceScatter, and copies the relevant slice to outputs. All three operators use variadic aliasing, contiguous input allocation, and support all IEEE float types.

Usage

Used during distributed training for gradient synchronization (AllReduce), parameter gathering (AllGather in ZeRO), and gradient scattering (ReduceScatter in ZeRO).

Code Reference

Source Location

Signature

class NcclAllReduce : public NcclKernel {
  NcclAllReduce(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

class NcclAllGather : public NcclKernel {
  NcclAllGather(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

class NcclReduceScatter : public NcclKernel {
  NcclReduceScatter(const OpKernelInfo& info);
  Status ComputeInternal(OpKernelContext* context) const;
};

Import

#include "orttraining/training_ops/cuda/collective/nccl_kernels.h"

I/O Contract

Inputs

Name Type Required Description
tensors Tensor(T)... Yes Variadic input tensors (IEEE float types, contiguously allocated)

Outputs

Name Type Description
output_tensors Tensor(T)... Reduced/gathered/scattered output tensors (one per input, aliased)

Usage Examples

ONNX_OPERATOR_KERNEL_EX(
    NcclAllReduce, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .VariadicAlias(0, 0)
        .AllocateInputsContiguously()
        .TypeConstraint("T", DataTypeImpl::AllIEEEFloatTensorTypes()),
    NcclAllReduce);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment