Implementation:Microsoft Onnxruntime CUDA Send

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for sending tensors to a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.

Description

Implements the Send operator for CUDA that transmits one or more tensors to a specified destination rank during distributed training. The implementation aggregates input tensors into a single aligned buffer before sending. For NCCL P2P, data is copied device-to-device into a scratch buffer and sent via NcclService; for MPI, data is copied from GPU to pinned CPU memory and sent via MPI_Send. When tensor shapes cannot be statically inferred by the receiver, shape metadata is sent first via a separate MPI call. Same-rank communication is explicitly prevented. NVTX profiling annotations track pre-send preparation, memory copy, and send phases.

Usage

Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to send activation tensors or gradients to another worker.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/communication/send.cc
Lines: 1-229

Signature

class Send : public CudaKernel {
  void SendData(OpKernelContext* ctx, const int dst, const int num_tensors,
                size_t aggregated_aligned_tensor_bytes,
                std::vector<size_t> tensor_offsets_in_bytes,
                std::vector<size_t> tensor_sizes_in_bytes) const;
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/communication/send.h"

I/O Contract

Inputs

Name	Type	Required	Description
input_signal	Tensor(bool)	Yes	Control signal that must be true to proceed (CPU memory)
remote_rank	Tensor(int64_t)	Yes	Rank of the destination process (CPU memory)
tensors	Tensor(V)...	Yes	One or more tensors to send (GPU memory)

Outputs

Name	Type	Description
output_signal	Tensor(bool)	Set to true after send completes (CPU memory)

Usage Examples

// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
    Send, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
        .TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
        .TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
    Send);

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment