Implementation:Sgl project Sglang Quick AllReduce Base

Knowledge Sources	Sgl_project_Sglang
Domains	GPU Communication, AMD ROCm, HIP
Last Updated	2026-02-10 00:00 GMT

Overview

A C++ header providing low-level primitives, constants, and architecture-specific intrinsics for the QuickReduce allreduce algorithm on AMD CDNA GPUs (MI300 series).

Description

quick_all_reduce_base.h is the foundation header for high-performance inter-GPU allreduce communication on AMD hardware. It provides:

Architecture-Specific Memory Ordering: Architecture-specific MUBUF acquire/release semantics for vector memory reads:

gfx942 (CDNA3/MI300): Uses scope bits sc0/sc1 with MUBUF_ACQUIRE = 16, MUBUF_RELEASE = 16
gfx908/gfx90a (CDNA1/CDNA2): Uses GLC bit with MUBUF_ACQUIRE = 1, MUBUF_RELEASE = 0

Core Constants:

kBlockSize = 256: Threads per workgroup
kAtoms = 8: 4xf16x2_t atoms per thread
kTileSize: 256 threads x 8 atoms x 16 bytes = 32 KB tile per workgroup
kMaxNumBlocks = 1216: 304 CUs on MI300 x 4 concurrent blocks
kWavefront = 64: Standard CDNA wavefront size
kThreadGroupSize = 8: For FP16 quantization (32 elements per block)

Buffer Resource Descriptor: The BufferResource union encodes AMD GPU buffer resource descriptors with 48-bit address, stride, byte range, and format configuration. This maps directly to the hardware SRD (Shader Resource Descriptor) format.

Low-Level Intrinsics:

buffer_load_dwordx4 / buffer_store_dwordx4: Mapped to LLVM llvm.amdgcn.raw.buffer.load/store.v4i32 intrinsics for 128-bit vector loads/stores
set_fp16_ovfl: Configures FP16 overflow behavior via s_setreg_imm32_b32 (gfx942 only)

Packed Arithmetic Templates: Template-specialized operations for both half and nv_bfloat16 types:

packed_assign_add: In-place addition using v_pk_add_f16 or __hadd2
packed_max / packed_min: Element-wise max/min via v_pk_max_f16 / v_pk_min_f16
packed_abs_max: Absolute value maximum for quantization scale computation
packed_add / packed_sub / packed_mul: Arithmetic on packed f16x2 or bf16x2 values
packed_rcp: Packed reciprocal via h2rcp

Synchronization:

set_sync_flag: Atomic store with release semantics
wait_sync_flag: Spin-wait with relaxed load semantics
group_abs_max: Wavefront-level reduction using __shfl_down and __shfl for computing per-group quantization scales

Usage

This header is included by the QuickReduce kernel implementations (one-shot and two-shot variants). It provides all the building blocks needed for implementing direct GPU-to-GPU allreduce without host-side synchronization.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/allreduce/quick_all_reduce_base.h
Lines: 1-319

Signature

namespace quickreduce {

// Architecture-specific memory ordering
#define MUBUF_ACQUIRE 16  // gfx942
#define MUBUF_RELEASE 16  // gfx942

// Core constants
static constexpr int kAtoms = 8;
static constexpr int kBlockSize = 256;
static constexpr int kTileSize = kBlockSize * kAtoms * sizeof(int32x4_t);
static constexpr int kMaxNumBlocks = 304 * 4;
static constexpr int kWavefront = 64;
static constexpr int kThreadGroupSize = 8;

// Buffer resource descriptor
union BufferResource {
    int32x4_t descriptor;
    struct { void* address; uint32_t range; uint32_t config; };
};

// Intrinsics
static int32x4_t buffer_load_dwordx4(int32x4_t srsrc, int32_t voffset, int32_t soffset, int32_t aux);
static void buffer_store_dwordx4(int32x4_t data, int32x4_t srsrc, int32_t voffset, int32_t soffset, int32_t aux);

// Packed arithmetic
template <typename T> void packed_assign_add(int32x4_t* A, int32x4_t* B);
template <typename T> int packed_max(int a, int b);
template <typename T> int packed_min(int a, int b);
template <typename T> int packed_abs_max(int a, int b);
template <typename T> int packed_add(int a, int b);
template <typename T> int packed_sub(int a, int b);
template <typename T> int packed_mul(int a, int b);
template <typename T> int packed_rcp(int a);

// Group reduction
template <typename T> int group_abs_max(int32x4_t atom);

// Synchronization
void set_sync_flag(uint32_t* flag_ptr, uint32_t flag);
void wait_sync_flag(uint32_t* flag_ptr, uint32_t flag);

}  // namespace quickreduce

Import

#include "quick_all_reduce_base.h"

I/O Contract

Inputs

Name	Type	Required	Description
int32x4_t* A	int32x4_t pointer	Yes	Source/destination packed vector for accumulation
int32x4_t* B	int32x4_t pointer	Yes	Source packed vector for accumulation
int a, b	int (packed f16x2 or bf16x2)	Yes	Packed half-precision pairs for arithmetic
uint32_t* flag_ptr	uint32_t pointer	Yes	Pointer to synchronization flag in GPU memory
uint32_t flag	uint32_t	Yes	Expected flag value for synchronization

Outputs

Name	Type	Description
Packed arithmetic results	int	Packed f16x2 or bf16x2 result of the arithmetic operation
group_abs_max result	int	Per-group maximum absolute value for quantization scaling
Synchronization	void	Side effect: atomic flag update or spin-wait completion

Usage Examples

Packed BF16 Addition

int32x4_t accumulator, data;
// ... load data ...
quickreduce::packed_assign_add<nv_bfloat16>(&accumulator, &data);

Synchronization Pattern

// Producer sets flag after writing data
quickreduce::set_sync_flag(flag_ptr, iteration);

// Consumer waits for flag before reading data
quickreduce::wait_sync_flag(flag_ptr, iteration);

Group Quantization Scale

int32x4_t atom = buffer_load_dwordx4(rsrc, voffset, soffset, MUBUF_ACQUIRE);
int scale = quickreduce::group_abs_max<half>(atom);

Related Pages

Environment:Sgl_project_Sglang_ROCm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment