Implementation:InternLM Lmdeploy DeviceComm

Knowledge Sources	InternLM_Lmdeploy
Domains	Communication, GPU_Computing
Last Updated	2026-02-07 15:00 GMT

Overview

Defines the abstract interface and smart-pointer wrapper for GPU device-level collective communication operations (AllReduce, AllGather, Broadcast, etc.).

Description

DeviceCommImpl is a pure virtual base class that defines the contract for device-level (GPU) collective communication. It provides virtual methods for AllReduceSum, AllGather, ReduceScatter, Broadcast, and specialized fused operations like AllreduceResidualBiasRMSnorm and AllGather2D. It also includes memory management methods (Allocate, Free, Register, Deregister) for communication buffer management and a Query method for checking backend capabilities. The DeviceComm wrapper class holds a std::unique_ptr<DeviceCommImpl> and provides transparent pointer-like access. The CreateDeviceCommunicator factory function instantiates the appropriate backend implementation.

Usage

Used by TurboMind model layers to perform GPU-to-GPU collective operations during tensor-parallel inference. The device communicator is created once during model initialization and passed to layers that require inter-GPU communication.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/comm/device_comm.h
Lines: 1-147

Signature

namespace turbomind::comm {

enum QueryAttr { kHasAllGather2D };

class DeviceCommImpl {
public:
    virtual ~DeviceCommImpl();
    virtual int n_ranks(int group) const = 0;
    virtual int rank(int group) const = 0;
    virtual void* Allocate(size_t size) = 0;
    virtual void Free(void* ptr) = 0;
    virtual void Register(void* ptr, size_t size) = 0;
    virtual void Deregister(void* ptr) = 0;
    virtual int Split(int color, int key, int group);
    virtual int Query(QueryAttr attr) const noexcept = 0;

    virtual void AllReduceSum(const void* sendbuff, void* recvbuff, size_t count,
                              DataType type, int group, cudaStream_t stream) = 0;
    virtual void AllGather(const void* sendbuff, void* recvbuff, size_t sendcount,
                           DataType type, int group, cudaStream_t stream) = 0;
    virtual void ReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount,
                               DataType type, int group, cudaStream_t stream);
    virtual void Broadcast(const void* sendbuff, void* recvbuff, size_t count,
                           DataType type, int root, int group, cudaStream_t stream);
};

class DeviceComm {
public:
    DeviceComm() = default;
    DeviceComm(std::unique_ptr<DeviceCommImpl> impl);
    DeviceCommImpl* operator->() const noexcept;
    operator DeviceCommImpl*() const noexcept;
};

DeviceComm CreateDeviceCommunicator(const std::string& backend,
                                    int n_ranks, int rank, HostComm h_comm);

}  // namespace turbomind::comm

Import

#include "src/turbomind/comm/device_comm.h"

I/O Contract

Inputs

Name	Type	Required	Description
backend	std::string	Yes	Backend identifier string for the communication library
n_ranks	int	Yes	Total number of ranks in the group
rank	int	Yes	This device's rank index
h_comm	HostComm	Yes	Host communicator used during device communicator setup
sendbuff	const void*	Yes	Source data buffer on GPU
recvbuff	void*	Yes	Destination data buffer on GPU
count	size_t	Yes	Number of elements to communicate
type	DataType	Yes	Element data type
group	int	Yes	Communication group identifier
stream	cudaStream_t	Yes	CUDA stream for asynchronous execution

Outputs

Name	Type	Description
DeviceComm	DeviceComm	Smart-pointer wrapper around the device communicator implementation

Usage Examples

#include "src/turbomind/comm/device_comm.h"

auto d_comm = turbomind::comm::CreateDeviceCommunicator("nccl", n_ranks, rank, h_comm);

// Perform an all-reduce sum on GPU
d_comm->AllReduceSum(send_buf, recv_buf, count, turbomind::kFloat16, group, stream);

// Perform an all-gather on GPU
d_comm->AllGather(send_buf, recv_buf, send_count, turbomind::kFloat16, group, stream);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment