Implementation:Sgl project Sglang Quick AllReduce

Knowledge Sources	Sgl_project_Sglang
Domains	GPU Communication, Distributed Computing
Last Updated	2026-02-10 00:00 GMT

Overview

High-level allreduce API and kernel dispatch for AMD ROCm GPUs using the QuickReduce two-shot algorithm with IPC buffer management.

Description

quick_all_reduce.h defines the DeviceComms class within the quickreduce namespace for managing multi-GPU communication state on AMD ROCm GPUs. The class manages IPC (Inter-Process Communication) buffer lists, flag offsets, and device memory allocation for up to kMaxWorldSize = 8 GPUs.

The header implements the allreduce_prototype_twoshot kernel launcher that dispatches to codec-specific allreduce implementations based on world size (2, 4, or 8 GPUs) and quantization level (F16, INT8, INT6, INT4). The TWOSHOT_DISPATCH macro instantiates templates for different world sizes and line codecs (CodecFP, CodecQ8, CodecQ6, CodecQ4).

The DeviceComms class provides:

init(): Allocates HIP device memory for flags and data buffers, creates IPC handles
open_ipc_handles(): Opens cross-GPU memory access via hipIpcOpenMemHandle
allreduce(): Template method dispatching to the appropriate kernel based on quantization level
destroy(): Cleans up IPC handles and device memory

The maximum problem size is 2GB, and the buffer allocation supports two-stage (two-shot) communication with both flags and data regions.

Usage

Use this allreduce implementation for low-latency multi-GPU tensor reduction on AMD ROCm hardware, providing deterministic allreduce with lower latency than NCCL for small-to-medium message sizes during distributed inference.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/allreduce/quick_all_reduce.h
Lines: 1-237

Signature

namespace quickreduce {

template <typename AllReduceKernel, typename T>
__global__ static void allreduce_prototype_twoshot(
    T const* A, T* B, uint32_t N, uint32_t num_blocks,
    int rank, uint8_t** dbuffer_list, uint32_t data_offset,
    uint32_t flag_color, int64_t data_size_per_phase);

enum QuickReduceQuantLevel { F16 = 0, INT8 = 1, INT6 = 2, INT4 = 3 };

struct DeviceComms {
    void init(int world_size, int rank, std::optional<int64_t> max_problem_size = std::nullopt);
    void destroy();
    void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);
    template <typename T, bool cast_bf2half>
    void allreduce(T const* A, T* B, uint32_t N, int quant_level, hipStream_t stream);
    int get_world_size();
    int get_rank();
    bool status();
    hipIpcMemHandle_t const get_handle();
};

}  // namespace quickreduce

Import

#include <hip/hip_runtime.h>
#include <vector>
#include "quick_all_reduce.cuh"

I/O Contract

Inputs

Name	Type	Required	Description
A	T const*	Yes	Input tensor data pointer on device
B	T*	Yes	Output tensor data pointer on device (can alias A)
N	uint32_t	Yes	Number of elements in the tensor
quant_level	int	Yes	Quantization level (F16=0, INT8=1, INT6=2, INT4=3)
stream	hipStream_t	Yes	HIP stream for asynchronous execution
world_size	int	Yes (init)	Number of GPUs participating (2, 4, or 8)
rank	int	Yes (init)	Rank of the current GPU

Outputs

Name	Type	Description
B	T*	Reduced tensor with sum across all ranks, written to the output pointer
buffer_ipc_handle	hipIpcMemHandle_t	IPC handle for cross-process buffer sharing

Usage Examples

#include "quick_all_reduce.h"

quickreduce::DeviceComms comms;
comms.init(/*world_size=*/4, /*rank=*/0);

// Exchange IPC handles between processes, then:
comms.open_ipc_handles(all_handles);

// Perform allreduce on half-precision data
comms.allreduce<half, false>(input_ptr, output_ptr, num_elements,
                             quickreduce::QuickReduceQuantLevel::F16, stream);

Related Pages

Environment:Sgl_project_Sglang_ROCm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment