Implementation:InternLM Lmdeploy DeviceComm
| Knowledge Sources | |
|---|---|
| Domains | Communication, GPU_Computing |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Defines the abstract interface and smart-pointer wrapper for GPU device-level collective communication operations (AllReduce, AllGather, Broadcast, etc.).
Description
DeviceCommImpl is a pure virtual base class that defines the contract for device-level (GPU) collective communication. It provides virtual methods for AllReduceSum, AllGather, ReduceScatter, Broadcast, and specialized fused operations like AllreduceResidualBiasRMSnorm and AllGather2D. It also includes memory management methods (Allocate, Free, Register, Deregister) for communication buffer management and a Query method for checking backend capabilities. The DeviceComm wrapper class holds a std::unique_ptr<DeviceCommImpl> and provides transparent pointer-like access. The CreateDeviceCommunicator factory function instantiates the appropriate backend implementation.
Usage
Used by TurboMind model layers to perform GPU-to-GPU collective operations during tensor-parallel inference. The device communicator is created once during model initialization and passed to layers that require inter-GPU communication.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/comm/device_comm.h
- Lines: 1-147
Signature
namespace turbomind::comm {
enum QueryAttr { kHasAllGather2D };
class DeviceCommImpl {
public:
virtual ~DeviceCommImpl();
virtual int n_ranks(int group) const = 0;
virtual int rank(int group) const = 0;
virtual void* Allocate(size_t size) = 0;
virtual void Free(void* ptr) = 0;
virtual void Register(void* ptr, size_t size) = 0;
virtual void Deregister(void* ptr) = 0;
virtual int Split(int color, int key, int group);
virtual int Query(QueryAttr attr) const noexcept = 0;
virtual void AllReduceSum(const void* sendbuff, void* recvbuff, size_t count,
DataType type, int group, cudaStream_t stream) = 0;
virtual void AllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
DataType type, int group, cudaStream_t stream) = 0;
virtual void ReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount,
DataType type, int group, cudaStream_t stream);
virtual void Broadcast(const void* sendbuff, void* recvbuff, size_t count,
DataType type, int root, int group, cudaStream_t stream);
};
class DeviceComm {
public:
DeviceComm() = default;
DeviceComm(std::unique_ptr<DeviceCommImpl> impl);
DeviceCommImpl* operator->() const noexcept;
operator DeviceCommImpl*() const noexcept;
};
DeviceComm CreateDeviceCommunicator(const std::string& backend,
int n_ranks, int rank, HostComm h_comm);
} // namespace turbomind::comm
Import
#include "src/turbomind/comm/device_comm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| backend | std::string | Yes | Backend identifier string for the communication library |
| n_ranks | int | Yes | Total number of ranks in the group |
| rank | int | Yes | This device's rank index |
| h_comm | HostComm | Yes | Host communicator used during device communicator setup |
| sendbuff | const void* | Yes | Source data buffer on GPU |
| recvbuff | void* | Yes | Destination data buffer on GPU |
| count | size_t | Yes | Number of elements to communicate |
| type | DataType | Yes | Element data type |
| group | int | Yes | Communication group identifier |
| stream | cudaStream_t | Yes | CUDA stream for asynchronous execution |
Outputs
| Name | Type | Description |
|---|---|---|
| DeviceComm | DeviceComm | Smart-pointer wrapper around the device communicator implementation |
Usage Examples
#include "src/turbomind/comm/device_comm.h"
auto d_comm = turbomind::comm::CreateDeviceCommunicator("nccl", n_ranks, rank, h_comm);
// Perform an all-reduce sum on GPU
d_comm->AllReduceSum(send_buf, recv_buf, count, turbomind::kFloat16, group, stream);
// Perform an all-gather on GPU
d_comm->AllGather(send_buf, recv_buf, send_count, turbomind::kFloat16, group, stream);