Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy DeviceComm

From Leeroopedia


Knowledge Sources
Domains Communication, GPU_Computing
Last Updated 2026-02-07 15:00 GMT

Overview

Defines the abstract interface and smart-pointer wrapper for GPU device-level collective communication operations (AllReduce, AllGather, Broadcast, etc.).

Description

DeviceCommImpl is a pure virtual base class that defines the contract for device-level (GPU) collective communication. It provides virtual methods for AllReduceSum, AllGather, ReduceScatter, Broadcast, and specialized fused operations like AllreduceResidualBiasRMSnorm and AllGather2D. It also includes memory management methods (Allocate, Free, Register, Deregister) for communication buffer management and a Query method for checking backend capabilities. The DeviceComm wrapper class holds a std::unique_ptr<DeviceCommImpl> and provides transparent pointer-like access. The CreateDeviceCommunicator factory function instantiates the appropriate backend implementation.

Usage

Used by TurboMind model layers to perform GPU-to-GPU collective operations during tensor-parallel inference. The device communicator is created once during model initialization and passed to layers that require inter-GPU communication.

Code Reference

Source Location

Signature

namespace turbomind::comm {

enum QueryAttr { kHasAllGather2D };

class DeviceCommImpl {
public:
    virtual ~DeviceCommImpl();
    virtual int n_ranks(int group) const = 0;
    virtual int rank(int group) const = 0;
    virtual void* Allocate(size_t size) = 0;
    virtual void Free(void* ptr) = 0;
    virtual void Register(void* ptr, size_t size) = 0;
    virtual void Deregister(void* ptr) = 0;
    virtual int Split(int color, int key, int group);
    virtual int Query(QueryAttr attr) const noexcept = 0;

    virtual void AllReduceSum(const void* sendbuff, void* recvbuff, size_t count,
                              DataType type, int group, cudaStream_t stream) = 0;
    virtual void AllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
                           DataType type, int group, cudaStream_t stream) = 0;
    virtual void ReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount,
                               DataType type, int group, cudaStream_t stream);
    virtual void Broadcast(const void* sendbuff, void* recvbuff, size_t count,
                           DataType type, int root, int group, cudaStream_t stream);
};

class DeviceComm {
public:
    DeviceComm() = default;
    DeviceComm(std::unique_ptr<DeviceCommImpl> impl);
    DeviceCommImpl* operator->() const noexcept;
    operator DeviceCommImpl*() const noexcept;
};

DeviceComm CreateDeviceCommunicator(const std::string& backend,
                                    int n_ranks, int rank, HostComm h_comm);

}  // namespace turbomind::comm

Import

#include "src/turbomind/comm/device_comm.h"

I/O Contract

Inputs

Name Type Required Description
backend std::string Yes Backend identifier string for the communication library
n_ranks int Yes Total number of ranks in the group
rank int Yes This device's rank index
h_comm HostComm Yes Host communicator used during device communicator setup
sendbuff const void* Yes Source data buffer on GPU
recvbuff void* Yes Destination data buffer on GPU
count size_t Yes Number of elements to communicate
type DataType Yes Element data type
group int Yes Communication group identifier
stream cudaStream_t Yes CUDA stream for asynchronous execution

Outputs

Name Type Description
DeviceComm DeviceComm Smart-pointer wrapper around the device communicator implementation

Usage Examples

#include "src/turbomind/comm/device_comm.h"

auto d_comm = turbomind::comm::CreateDeviceCommunicator("nccl", n_ranks, rank, h_comm);

// Perform an all-reduce sum on GPU
d_comm->AllReduceSum(send_buf, recv_buf, count, turbomind::kFloat16, group, stream);

// Perform an all-gather on GPU
d_comm->AllGather(send_buf, recv_buf, send_count, turbomind::kFloat16, group, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment