Implementation:Microsoft Onnxruntime CUDA NcclService
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for orchestrating NCCL peer-to-peer communication as a background service in the ONNX Runtime CUDA training framework.
Description
Implements the NcclService singleton class that manages scheduled NCCL point-to-point (Send/Recv) operations via a background worker thread. The service uses a planning phase where communication groups and tasks are registered via PlanSend/PlanRecv, followed by a launch phase where a dedicated thread processes tasks in batched NCCL group calls. Key classes include NcclTask (represents a single Send or Recv operation with peer, pointer, and size), NcclTaskGroup (a batch of tasks executed in a single ncclGroupStart/End block), and NcclService (the singleton service). The service initializes NCCL communicators using MPI for rank discovery and unique ID broadcast. Thread synchronization uses mutexes and condition variables to coordinate between compute threads submitting tasks and the background NCCL execution thread. The Reset method enables reuse across gradient accumulation iterations.
Usage
Used in distributed training with NCCL P2P support enabled. Compute threads call SubmitSendAndWait or SubmitRecvAndWait to enqueue communication operations that are executed by the background service thread.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/communication/nccl_service.cc
- Lines: 1-384
Signature
INcclService& GetINcclService();
struct NcclTask {
enum class Type { SEND, RECV };
bool Compare(const NcclTask& other) const;
void ResetTask();
};
struct NcclTaskGroup {
void PlanTask(const NcclTask::Type type, const std::vector<int> peers);
const NcclTask* EqueueTask(const NcclTask::Type type, const std::vector<int> peers,
void* ptr, const size_t size, const std::string info);
bool IsAllTasksEqueued() const;
bool IsAllTasksFinished() const;
void ResetAllTasks();
};
class NcclService {
void PlanStart();
void PlanEnd();
void PlanNewGroupStart();
void PlanNewGroupEnd();
void PlanSend(const int dst);
void PlanRecv(const int src);
void SubmitSendAndWait(void* ptr, size_t size, int peer);
void SubmitRecvAndWait(void* ptr, size_t size, int peer);
void Initialize();
void Launch();
void Reset();
void Terminate();
};
Import
#include "orttraining/training_ops/cuda/communication/nccl_service.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ptr | void* | Yes | Pointer to device memory buffer for send/receive data |
| size | size_t | Yes | Size in bytes of the data to send or receive |
| peer | int | Yes | Rank of the remote peer process |
Outputs
| Name | Type | Description |
|---|---|---|
| (blocking) | void | SubmitSendAndWait and SubmitRecvAndWait block until the NCCL operation completes |
Usage Examples
// Get the singleton NCCL service instance
auto& nccl_service = NcclService::GetInstance();
// During graph planning phase
nccl_service.PlanStart();
nccl_service.PlanNewGroupStart();
nccl_service.PlanSend(/*dst=*/1);
nccl_service.PlanRecv(/*src=*/1);
nccl_service.PlanNewGroupEnd();
nccl_service.PlanEnd();
// Launch the background service thread
nccl_service.Launch();
// During execution, from compute threads
nccl_service.SubmitSendAndWait(device_ptr, num_bytes, /*peer=*/1);
nccl_service.SubmitRecvAndWait(recv_ptr, num_bytes, /*peer=*/1);
// Reset for next iteration
nccl_service.Reset();