Implementation:Microsoft Onnxruntime CUDA NcclService

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for orchestrating NCCL peer-to-peer communication as a background service in the ONNX Runtime CUDA training framework.

Description

Implements the NcclService singleton class that manages scheduled NCCL point-to-point (Send/Recv) operations via a background worker thread. The service uses a planning phase where communication groups and tasks are registered via PlanSend/PlanRecv, followed by a launch phase where a dedicated thread processes tasks in batched NCCL group calls. Key classes include NcclTask (represents a single Send or Recv operation with peer, pointer, and size), NcclTaskGroup (a batch of tasks executed in a single ncclGroupStart/End block), and NcclService (the singleton service). The service initializes NCCL communicators using MPI for rank discovery and unique ID broadcast. Thread synchronization uses mutexes and condition variables to coordinate between compute threads submitting tasks and the background NCCL execution thread. The Reset method enables reuse across gradient accumulation iterations.

Usage

Used in distributed training with NCCL P2P support enabled. Compute threads call SubmitSendAndWait or SubmitRecvAndWait to enqueue communication operations that are executed by the background service thread.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/communication/nccl_service.cc
Lines: 1-384

Signature

INcclService& GetINcclService();

struct NcclTask {
  enum class Type { SEND, RECV };
  bool Compare(const NcclTask& other) const;
  void ResetTask();
};

struct NcclTaskGroup {
  void PlanTask(const NcclTask::Type type, const std::vector<int> peers);
  const NcclTask* EqueueTask(const NcclTask::Type type, const std::vector<int> peers,
                              void* ptr, const size_t size, const std::string info);
  bool IsAllTasksEqueued() const;
  bool IsAllTasksFinished() const;
  void ResetAllTasks();
};

class NcclService {
  void PlanStart();
  void PlanEnd();
  void PlanNewGroupStart();
  void PlanNewGroupEnd();
  void PlanSend(const int dst);
  void PlanRecv(const int src);
  void SubmitSendAndWait(void* ptr, size_t size, int peer);
  void SubmitRecvAndWait(void* ptr, size_t size, int peer);
  void Initialize();
  void Launch();
  void Reset();
  void Terminate();
};

Import

#include "orttraining/training_ops/cuda/communication/nccl_service.h"

I/O Contract

Inputs

Name	Type	Required	Description
ptr	void*	Yes	Pointer to device memory buffer for send/receive data
size	size_t	Yes	Size in bytes of the data to send or receive
peer	int	Yes	Rank of the remote peer process

Outputs

Name	Type	Description
(blocking)	void	SubmitSendAndWait and SubmitRecvAndWait block until the NCCL operation completes

Usage Examples

// Get the singleton NCCL service instance
auto& nccl_service = NcclService::GetInstance();

// During graph planning phase
nccl_service.PlanStart();
nccl_service.PlanNewGroupStart();
nccl_service.PlanSend(/*dst=*/1);
nccl_service.PlanRecv(/*src=*/1);
nccl_service.PlanNewGroupEnd();
nccl_service.PlanEnd();

// Launch the background service thread
nccl_service.Launch();

// During execution, from compute threads
nccl_service.SubmitSendAndWait(device_ptr, num_bytes, /*peer=*/1);
nccl_service.SubmitRecvAndWait(recv_ptr, num_bytes, /*peer=*/1);

// Reset for next iteration
nccl_service.Reset();

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment