Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA NcclService

From Leeroopedia


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for orchestrating NCCL peer-to-peer communication as a background service in the ONNX Runtime CUDA training framework.

Description

Implements the NcclService singleton class that manages scheduled NCCL point-to-point (Send/Recv) operations via a background worker thread. The service uses a planning phase where communication groups and tasks are registered via PlanSend/PlanRecv, followed by a launch phase where a dedicated thread processes tasks in batched NCCL group calls. Key classes include NcclTask (represents a single Send or Recv operation with peer, pointer, and size), NcclTaskGroup (a batch of tasks executed in a single ncclGroupStart/End block), and NcclService (the singleton service). The service initializes NCCL communicators using MPI for rank discovery and unique ID broadcast. Thread synchronization uses mutexes and condition variables to coordinate between compute threads submitting tasks and the background NCCL execution thread. The Reset method enables reuse across gradient accumulation iterations.

Usage

Used in distributed training with NCCL P2P support enabled. Compute threads call SubmitSendAndWait or SubmitRecvAndWait to enqueue communication operations that are executed by the background service thread.

Code Reference

Source Location

Signature

INcclService& GetINcclService();

struct NcclTask {
  enum class Type { SEND, RECV };
  bool Compare(const NcclTask& other) const;
  void ResetTask();
};

struct NcclTaskGroup {
  void PlanTask(const NcclTask::Type type, const std::vector<int> peers);
  const NcclTask* EqueueTask(const NcclTask::Type type, const std::vector<int> peers,
                              void* ptr, const size_t size, const std::string info);
  bool IsAllTasksEqueued() const;
  bool IsAllTasksFinished() const;
  void ResetAllTasks();
};

class NcclService {
  void PlanStart();
  void PlanEnd();
  void PlanNewGroupStart();
  void PlanNewGroupEnd();
  void PlanSend(const int dst);
  void PlanRecv(const int src);
  void SubmitSendAndWait(void* ptr, size_t size, int peer);
  void SubmitRecvAndWait(void* ptr, size_t size, int peer);
  void Initialize();
  void Launch();
  void Reset();
  void Terminate();
};

Import

#include "orttraining/training_ops/cuda/communication/nccl_service.h"

I/O Contract

Inputs

Name Type Required Description
ptr void* Yes Pointer to device memory buffer for send/receive data
size size_t Yes Size in bytes of the data to send or receive
peer int Yes Rank of the remote peer process

Outputs

Name Type Description
(blocking) void SubmitSendAndWait and SubmitRecvAndWait block until the NCCL operation completes

Usage Examples

// Get the singleton NCCL service instance
auto& nccl_service = NcclService::GetInstance();

// During graph planning phase
nccl_service.PlanStart();
nccl_service.PlanNewGroupStart();
nccl_service.PlanSend(/*dst=*/1);
nccl_service.PlanRecv(/*src=*/1);
nccl_service.PlanNewGroupEnd();
nccl_service.PlanEnd();

// Launch the background service thread
nccl_service.Launch();

// During execution, from compute threads
nccl_service.SubmitSendAndWait(device_ptr, num_bytes, /*peer=*/1);
nccl_service.SubmitRecvAndWait(recv_ptr, num_bytes, /*peer=*/1);

// Reset for next iteration
nccl_service.Reset();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment