Implementation:Microsoft Onnxruntime CUDA Send
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for sending tensors to a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.
Description
Implements the Send operator for CUDA that transmits one or more tensors to a specified destination rank during distributed training. The implementation aggregates input tensors into a single aligned buffer before sending. For NCCL P2P, data is copied device-to-device into a scratch buffer and sent via NcclService; for MPI, data is copied from GPU to pinned CPU memory and sent via MPI_Send. When tensor shapes cannot be statically inferred by the receiver, shape metadata is sent first via a separate MPI call. Same-rank communication is explicitly prevented. NVTX profiling annotations track pre-send preparation, memory copy, and send phases.
Usage
Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to send activation tensors or gradients to another worker.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/communication/send.cc
- Lines: 1-229
Signature
class Send : public CudaKernel {
void SendData(OpKernelContext* ctx, const int dst, const int num_tensors,
size_t aggregated_aligned_tensor_bytes,
std::vector<size_t> tensor_offsets_in_bytes,
std::vector<size_t> tensor_sizes_in_bytes) const;
Status ComputeInternal(OpKernelContext* ctx) const;
};
Import
#include "orttraining/training_ops/cuda/communication/send.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_signal | Tensor(bool) | Yes | Control signal that must be true to proceed (CPU memory) |
| remote_rank | Tensor(int64_t) | Yes | Rank of the destination process (CPU memory) |
| tensors | Tensor(V)... | Yes | One or more tensors to send (GPU memory) |
Outputs
| Name | Type | Description |
|---|---|---|
| output_signal | Tensor(bool) | Set to true after send completes (CPU memory) |
Usage Examples
// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
Send, kMSDomain, 1, kCudaExecutionProvider,
(*KernelDefBuilder::Create())
.InputMemoryType(OrtMemTypeCPUInput, 0)
.InputMemoryType(OrtMemTypeCPUInput, 1)
.OutputMemoryType(OrtMemTypeCPUOutput, 0)
.TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
.TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
.TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
Send);