Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA Recv

From Leeroopedia
Revision as of 15:45, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_Onnxruntime_CUDA_Recv.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for receiving tensors from a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.

Description

Implements the Recv operator for CUDA that receives one or more tensors from a specified source rank during distributed training. The implementation supports two communication backends: NCCL P2P (device-to-device) and MPI (host-mediated). For NCCL P2P, data is received directly into GPU scratch buffers via the NcclService; for MPI, data is received into pinned CPU memory and then copied to GPU. The operator handles both statically-inferred and dynamically-received tensor shapes. When shapes cannot be inferred, shape metadata is received from the source process via a separate MPI call before the main data transfer. Tensors are aggregated into an aligned buffer and then distributed to individual output tensors. NVTX profiling ranges are used for performance analysis.

Usage

Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to receive activation tensors or gradients from another worker.

Code Reference

Source Location

Signature

class Recv : public CudaKernel {
  void ReceiveData(const int num_tensors, std::vector<Tensor*> received_tensors,
                   const int src, const size_t aggregated_aligned_tensor_bytes,
                   OpKernelContext* context, IAllocatorUniquePtr<char>& buffer) const;
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/communication/recv.h"

I/O Contract

Inputs

Name Type Required Description
input_signal Tensor(bool) Yes Control signal that must be true to proceed (CPU memory)
remote_rank Tensor(int64_t) Yes Rank of the source process to receive from (CPU memory)

Outputs

Name Type Description
output_signal Tensor(bool) Set to true after receive completes (CPU memory)
received_tensors Tensor(V)... One or more received tensors on GPU

Usage Examples

// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
    Recv, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
        .TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
        .TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
    Recv);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment