Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang Quick AllReduce

From Leeroopedia


Knowledge Sources
Domains GPU Communication, Distributed Computing
Last Updated 2026-02-10 00:00 GMT

Overview

High-level allreduce API and kernel dispatch for AMD ROCm GPUs using the QuickReduce two-shot algorithm with IPC buffer management.

Description

quick_all_reduce.h defines the DeviceComms class within the quickreduce namespace for managing multi-GPU communication state on AMD ROCm GPUs. The class manages IPC (Inter-Process Communication) buffer lists, flag offsets, and device memory allocation for up to kMaxWorldSize = 8 GPUs.

The header implements the allreduce_prototype_twoshot kernel launcher that dispatches to codec-specific allreduce implementations based on world size (2, 4, or 8 GPUs) and quantization level (F16, INT8, INT6, INT4). The TWOSHOT_DISPATCH macro instantiates templates for different world sizes and line codecs (CodecFP, CodecQ8, CodecQ6, CodecQ4).

The DeviceComms class provides:

  • init(): Allocates HIP device memory for flags and data buffers, creates IPC handles
  • open_ipc_handles(): Opens cross-GPU memory access via hipIpcOpenMemHandle
  • allreduce(): Template method dispatching to the appropriate kernel based on quantization level
  • destroy(): Cleans up IPC handles and device memory

The maximum problem size is 2GB, and the buffer allocation supports two-stage (two-shot) communication with both flags and data regions.

Usage

Use this allreduce implementation for low-latency multi-GPU tensor reduction on AMD ROCm hardware, providing deterministic allreduce with lower latency than NCCL for small-to-medium message sizes during distributed inference.

Code Reference

Source Location

Signature

namespace quickreduce {

template <typename AllReduceKernel, typename T>
__global__ static void allreduce_prototype_twoshot(
    T const* A, T* B, uint32_t N, uint32_t num_blocks,
    int rank, uint8_t** dbuffer_list, uint32_t data_offset,
    uint32_t flag_color, int64_t data_size_per_phase);

enum QuickReduceQuantLevel { F16 = 0, INT8 = 1, INT6 = 2, INT4 = 3 };

struct DeviceComms {
    void init(int world_size, int rank, std::optional<int64_t> max_problem_size = std::nullopt);
    void destroy();
    void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);
    template <typename T, bool cast_bf2half>
    void allreduce(T const* A, T* B, uint32_t N, int quant_level, hipStream_t stream);
    int get_world_size();
    int get_rank();
    bool status();
    hipIpcMemHandle_t const get_handle();
};

}  // namespace quickreduce

Import

#include <hip/hip_runtime.h>
#include <vector>
#include "quick_all_reduce.cuh"

I/O Contract

Inputs

Name Type Required Description
A T const* Yes Input tensor data pointer on device
B T* Yes Output tensor data pointer on device (can alias A)
N uint32_t Yes Number of elements in the tensor
quant_level int Yes Quantization level (F16=0, INT8=1, INT6=2, INT4=3)
stream hipStream_t Yes HIP stream for asynchronous execution
world_size int Yes (init) Number of GPUs participating (2, 4, or 8)
rank int Yes (init) Rank of the current GPU

Outputs

Name Type Description
B T* Reduced tensor with sum across all ranks, written to the output pointer
buffer_ipc_handle hipIpcMemHandle_t IPC handle for cross-process buffer sharing

Usage Examples

#include "quick_all_reduce.h"

quickreduce::DeviceComms comms;
comms.init(/*world_size=*/4, /*rank=*/0);

// Exchange IPC handles between processes, then:
comms.open_ipc_handles(all_handles);

// Perform allreduce on half-precision data
comms.allreduce<half, false>(input_ptr, output_ptr, num_elements,
                             quickreduce::QuickReduceQuantLevel::F16, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment