Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA Send

From Leeroopedia


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for sending tensors to a remote process via NCCL or MPI in the ONNX Runtime CUDA training framework.

Description

Implements the Send operator for CUDA that transmits one or more tensors to a specified destination rank during distributed training. The implementation aggregates input tensors into a single aligned buffer before sending. For NCCL P2P, data is copied device-to-device into a scratch buffer and sent via NcclService; for MPI, data is copied from GPU to pinned CPU memory and sent via MPI_Send. When tensor shapes cannot be statically inferred by the receiver, shape metadata is sent first via a separate MPI call. Same-rank communication is explicitly prevented. NVTX profiling annotations track pre-send preparation, memory copy, and send phases.

Usage

Invoked during distributed training pipeline parallelism or model parallelism when a GPU worker needs to send activation tensors or gradients to another worker.

Code Reference

Source Location

Signature

class Send : public CudaKernel {
  void SendData(OpKernelContext* ctx, const int dst, const int num_tensors,
                size_t aggregated_aligned_tensor_bytes,
                std::vector<size_t> tensor_offsets_in_bytes,
                std::vector<size_t> tensor_sizes_in_bytes) const;
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/communication/send.h"

I/O Contract

Inputs

Name Type Required Description
input_signal Tensor(bool) Yes Control signal that must be true to proceed (CPU memory)
remote_rank Tensor(int64_t) Yes Rank of the destination process (CPU memory)
tensors Tensor(V)... Yes One or more tensors to send (GPU memory)

Outputs

Name Type Description
output_signal Tensor(bool) Set to true after send completes (CPU memory)

Usage Examples

// Kernel registration
ONNX_OPERATOR_KERNEL_EX(
    Send, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 0)
        .InputMemoryType(OrtMemTypeCPUInput, 1)
        .OutputMemoryType(OrtMemTypeCPUOutput, 0)
        .TypeConstraint("TBool", DataTypeImpl::GetTensorType<bool>())
        .TypeConstraint("TInt64", DataTypeImpl::GetTensorType<int64_t>())
        .TypeConstraint("V", DataTypeImpl::AllFixedSizeTensorTypes()),
    Send);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment