Implementation:Sgl project Sglang Quick AllReduce Base
| Knowledge Sources | |
|---|---|
| Domains | GPU Communication, AMD ROCm, HIP |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A C++ header providing low-level primitives, constants, and architecture-specific intrinsics for the QuickReduce allreduce algorithm on AMD CDNA GPUs (MI300 series).
Description
quick_all_reduce_base.h is the foundation header for high-performance inter-GPU allreduce communication on AMD hardware. It provides:
Architecture-Specific Memory Ordering: Architecture-specific MUBUF acquire/release semantics for vector memory reads:
- gfx942 (CDNA3/MI300): Uses scope bits sc0/sc1 with
MUBUF_ACQUIRE = 16,MUBUF_RELEASE = 16 - gfx908/gfx90a (CDNA1/CDNA2): Uses GLC bit with
MUBUF_ACQUIRE = 1,MUBUF_RELEASE = 0
Core Constants:
- kBlockSize = 256: Threads per workgroup
- kAtoms = 8: 4xf16x2_t atoms per thread
- kTileSize: 256 threads x 8 atoms x 16 bytes = 32 KB tile per workgroup
- kMaxNumBlocks = 1216: 304 CUs on MI300 x 4 concurrent blocks
- kWavefront = 64: Standard CDNA wavefront size
- kThreadGroupSize = 8: For FP16 quantization (32 elements per block)
Buffer Resource Descriptor: The BufferResource union encodes AMD GPU buffer resource descriptors with 48-bit address, stride, byte range, and format configuration. This maps directly to the hardware SRD (Shader Resource Descriptor) format.
Low-Level Intrinsics:
- buffer_load_dwordx4 / buffer_store_dwordx4: Mapped to LLVM
llvm.amdgcn.raw.buffer.load/store.v4i32intrinsics for 128-bit vector loads/stores - set_fp16_ovfl: Configures FP16 overflow behavior via
s_setreg_imm32_b32(gfx942 only)
Packed Arithmetic Templates: Template-specialized operations for both half and nv_bfloat16 types:
- packed_assign_add: In-place addition using
v_pk_add_f16or__hadd2 - packed_max / packed_min: Element-wise max/min via
v_pk_max_f16/v_pk_min_f16 - packed_abs_max: Absolute value maximum for quantization scale computation
- packed_add / packed_sub / packed_mul: Arithmetic on packed f16x2 or bf16x2 values
- packed_rcp: Packed reciprocal via
h2rcp
Synchronization:
- set_sync_flag: Atomic store with release semantics
- wait_sync_flag: Spin-wait with relaxed load semantics
- group_abs_max: Wavefront-level reduction using
__shfl_downand__shflfor computing per-group quantization scales
Usage
This header is included by the QuickReduce kernel implementations (one-shot and two-shot variants). It provides all the building blocks needed for implementing direct GPU-to-GPU allreduce without host-side synchronization.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/allreduce/quick_all_reduce_base.h
- Lines: 1-319
Signature
namespace quickreduce {
// Architecture-specific memory ordering
#define MUBUF_ACQUIRE 16 // gfx942
#define MUBUF_RELEASE 16 // gfx942
// Core constants
static constexpr int kAtoms = 8;
static constexpr int kBlockSize = 256;
static constexpr int kTileSize = kBlockSize * kAtoms * sizeof(int32x4_t);
static constexpr int kMaxNumBlocks = 304 * 4;
static constexpr int kWavefront = 64;
static constexpr int kThreadGroupSize = 8;
// Buffer resource descriptor
union BufferResource {
int32x4_t descriptor;
struct { void* address; uint32_t range; uint32_t config; };
};
// Intrinsics
static int32x4_t buffer_load_dwordx4(int32x4_t srsrc, int32_t voffset, int32_t soffset, int32_t aux);
static void buffer_store_dwordx4(int32x4_t data, int32x4_t srsrc, int32_t voffset, int32_t soffset, int32_t aux);
// Packed arithmetic
template <typename T> void packed_assign_add(int32x4_t* A, int32x4_t* B);
template <typename T> int packed_max(int a, int b);
template <typename T> int packed_min(int a, int b);
template <typename T> int packed_abs_max(int a, int b);
template <typename T> int packed_add(int a, int b);
template <typename T> int packed_sub(int a, int b);
template <typename T> int packed_mul(int a, int b);
template <typename T> int packed_rcp(int a);
// Group reduction
template <typename T> int group_abs_max(int32x4_t atom);
// Synchronization
void set_sync_flag(uint32_t* flag_ptr, uint32_t flag);
void wait_sync_flag(uint32_t* flag_ptr, uint32_t flag);
} // namespace quickreduce
Import
#include "quick_all_reduce_base.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| int32x4_t* A | int32x4_t pointer | Yes | Source/destination packed vector for accumulation |
| int32x4_t* B | int32x4_t pointer | Yes | Source packed vector for accumulation |
| int a, b | int (packed f16x2 or bf16x2) | Yes | Packed half-precision pairs for arithmetic |
| uint32_t* flag_ptr | uint32_t pointer | Yes | Pointer to synchronization flag in GPU memory |
| uint32_t flag | uint32_t | Yes | Expected flag value for synchronization |
Outputs
| Name | Type | Description |
|---|---|---|
| Packed arithmetic results | int | Packed f16x2 or bf16x2 result of the arithmetic operation |
| group_abs_max result | int | Per-group maximum absolute value for quantization scaling |
| Synchronization | void | Side effect: atomic flag update or spin-wait completion |
Usage Examples
Packed BF16 Addition
int32x4_t accumulator, data;
// ... load data ...
quickreduce::packed_assign_add<nv_bfloat16>(&accumulator, &data);
Synchronization Pattern
// Producer sets flag after writing data
quickreduce::set_sync_flag(flag_ptr, iteration);
// Consumer waits for flag before reading data
quickreduce::wait_sync_flag(flag_ptr, iteration);
Group Quantization Scale
int32x4_t atom = buffer_load_dwordx4(rsrc, voffset, soffset, MUBUF_ACQUIRE);
int scale = quickreduce::group_abs_max<half>(atom);