Implementation:Sgl project Sglang Quick AllReduce
| Knowledge Sources | |
|---|---|
| Domains | GPU Communication, Distributed Computing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
High-level allreduce API and kernel dispatch for AMD ROCm GPUs using the QuickReduce two-shot algorithm with IPC buffer management.
Description
quick_all_reduce.h defines the DeviceComms class within the quickreduce namespace for managing multi-GPU communication state on AMD ROCm GPUs. The class manages IPC (Inter-Process Communication) buffer lists, flag offsets, and device memory allocation for up to kMaxWorldSize = 8 GPUs.
The header implements the allreduce_prototype_twoshot kernel launcher that dispatches to codec-specific allreduce implementations based on world size (2, 4, or 8 GPUs) and quantization level (F16, INT8, INT6, INT4). The TWOSHOT_DISPATCH macro instantiates templates for different world sizes and line codecs (CodecFP, CodecQ8, CodecQ6, CodecQ4).
The DeviceComms class provides:
- init(): Allocates HIP device memory for flags and data buffers, creates IPC handles
- open_ipc_handles(): Opens cross-GPU memory access via hipIpcOpenMemHandle
- allreduce(): Template method dispatching to the appropriate kernel based on quantization level
- destroy(): Cleans up IPC handles and device memory
The maximum problem size is 2GB, and the buffer allocation supports two-stage (two-shot) communication with both flags and data regions.
Usage
Use this allreduce implementation for low-latency multi-GPU tensor reduction on AMD ROCm hardware, providing deterministic allreduce with lower latency than NCCL for small-to-medium message sizes during distributed inference.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/allreduce/quick_all_reduce.h
- Lines: 1-237
Signature
namespace quickreduce {
template <typename AllReduceKernel, typename T>
__global__ static void allreduce_prototype_twoshot(
T const* A, T* B, uint32_t N, uint32_t num_blocks,
int rank, uint8_t** dbuffer_list, uint32_t data_offset,
uint32_t flag_color, int64_t data_size_per_phase);
enum QuickReduceQuantLevel { F16 = 0, INT8 = 1, INT6 = 2, INT4 = 3 };
struct DeviceComms {
void init(int world_size, int rank, std::optional<int64_t> max_problem_size = std::nullopt);
void destroy();
void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);
template <typename T, bool cast_bf2half>
void allreduce(T const* A, T* B, uint32_t N, int quant_level, hipStream_t stream);
int get_world_size();
int get_rank();
bool status();
hipIpcMemHandle_t const get_handle();
};
} // namespace quickreduce
Import
#include <hip/hip_runtime.h>
#include <vector>
#include "quick_all_reduce.cuh"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| A | T const* | Yes | Input tensor data pointer on device |
| B | T* | Yes | Output tensor data pointer on device (can alias A) |
| N | uint32_t | Yes | Number of elements in the tensor |
| quant_level | int | Yes | Quantization level (F16=0, INT8=1, INT6=2, INT4=3) |
| stream | hipStream_t | Yes | HIP stream for asynchronous execution |
| world_size | int | Yes (init) | Number of GPUs participating (2, 4, or 8) |
| rank | int | Yes (init) | Rank of the current GPU |
Outputs
| Name | Type | Description |
|---|---|---|
| B | T* | Reduced tensor with sum across all ranks, written to the output pointer |
| buffer_ipc_handle | hipIpcMemHandle_t | IPC handle for cross-process buffer sharing |
Usage Examples
#include "quick_all_reduce.h"
quickreduce::DeviceComms comms;
comms.init(/*world_size=*/4, /*rank=*/0);
// Exchange IPC handles between processes, then:
comms.open_ipc_handles(all_handles);
// Perform allreduce on half-precision data
comms.allreduce<half, false>(input_ptr, output_ptr, num_elements,
quickreduce::QuickReduceQuantLevel::F16, stream);