Implementation:Microsoft Onnxruntime CUDA Recv
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for receiving tensors from a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.
Description
Implements the Recv operator for CUDA that receives one or more tensors from a specified source rank during distributed training. The implementation supports two communication backends: NCCL P2P (device-to-device) and MPI (host-mediated). For NCCL P2P, data is received directly into GPU scratch buffers via the NcclService; for MPI, data is received into pinned CPU memory and then copied to GPU. The operator handles both statically-inferred and dynamically-received tensor shapes. When shapes cannot be inferred, shape metadata is received from the source process via a separate MPI call before the main data transfer. Tensors are aggregated into an aligned buffer and then distributed to individual output tensors. NVTX profiling ranges are used for performance analysis.
Usage
Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to receive activation tensors or gradients from another worker.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/communication/recv.cc
- Lines: 1-278
Signature
class Recv : public CudaKernel {
void ReceiveData(const int num_tensors, std::vector<Tensor*> received_tensors,
const int src, const size_t aggregated_aligned_tensor_bytes,
OpKernelContext* context, IAllocatorUniquePtr<char>& buffer) const;
Status ComputeInternal(OpKernelContext* ctx) const;
};
Import
#include "orttraining/training_ops/cuda/communication/recv.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_signal | Tensor(bool) | Yes | Control signal that must be true to proceed (CPU memory) |
| remote_rank | Tensor(int64_t) | Yes | Rank of the source process to receive from (CPU memory) |
Outputs
| Name | Type | Description |
|---|---|---|
| output_signal | Tensor(bool) | Set to true after receive completes (CPU memory) |
| received_tensors | Tensor(V)... | One or more received tensors on GPU |
Usage Examples
// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
Recv, kMSDomain, 1, kCudaExecutionProvider,
(*KernelDefBuilder::Create())
.InputMemoryType(OrtMemTypeCPUInput, 0)
.InputMemoryType(OrtMemTypeCPUInput, 1)
.OutputMemoryType(OrtMemTypeCPUOutput, 0)
.TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
.TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
.TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
Recv);