Implementation:Microsoft Onnxruntime CUDA Recv

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for receiving tensors from a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.

Description

Implements the Recv operator for CUDA that receives one or more tensors from a specified source rank during distributed training. The implementation supports two communication backends: NCCL P2P (device-to-device) and MPI (host-mediated). For NCCL P2P, data is received directly into GPU scratch buffers via the NcclService; for MPI, data is received into pinned CPU memory and then copied to GPU. The operator handles both statically-inferred and dynamically-received tensor shapes. When shapes cannot be inferred, shape metadata is received from the source process via a separate MPI call before the main data transfer. Tensors are aggregated into an aligned buffer and then distributed to individual output tensors. NVTX profiling ranges are used for performance analysis.

Usage

Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to receive activation tensors or gradients from another worker.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/communication/recv.cc
Lines: 1-278

Signature

class Recv : public CudaKernel {
  void ReceiveData(const int num_tensors, std::vector<Tensor*> received_tensors,
                   const int src, const size_t aggregated_aligned_tensor_bytes,
                   OpKernelContext* context, IAllocatorUniquePtr<char>& buffer) const;
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/communication/recv.h"

I/O Contract

Inputs

Name	Type	Required	Description
input_signal	Tensor(bool)	Yes	Control signal that must be true to proceed (CPU memory)
remote_rank	Tensor(int64_t)	Yes	Rank of the source process to receive from (CPU memory)

Outputs

Name	Type	Description
output_signal	Tensor(bool)	Set to true after receive completes (CPU memory)
received_tensors	Tensor(V)...	One or more received tensors on GPU

Usage Examples

// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
    Recv, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
        .TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
        .TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
    Recv);

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment