Implementation:Vllm project Vllm QuickReduce Base
| Knowledge Sources | |
|---|---|
| Domains | AllReduce, ROCm, Collective_Communication |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Defines ROCm/HIP-specific constants, buffer resource descriptors, vectorized arithmetic primitives, and memory coherence semantics for efficient all-reduce operations on AMD CDNA GPUs.
Description
This header provides the foundational types and device functions for the QuickReduce collective communication library targeting AMD GPUs (MI200, MI300 series). It defines architecture-specific memory coherence controls (MUBUF_ACQUIRE/MUBUF_RELEASE for CDNA1/2/3), buffer resource descriptors (BufferResource union for MUBUF instructions), and tiling constants (kBlockSize = 256 threads, kAtoms = 8 per thread, kTileSize = 32KB per workgroup). The file also implements packed vectorized arithmetic templates (packed_assign_add, packed_max, packed_min, packed_abs_max, packed_add, packed_sub) specialized for both half and nv_bfloat16 types using AMD GCN inline assembly (v_pk_add_f16, v_pk_max_f16, etc.).
Usage
This header is included by QuickReduce kernel implementations for AMD ROCm builds. It is compiled when building vLLM with ROCm support and is used during multi-GPU inference for tensor-parallel all-reduce operations on AMD CDNA architectures.
Code Reference
Source Location
- Repository: vllm
- File: csrc/quickreduce/base.h
- Lines: 1-338
Signature
namespace quickreduce {
// Architecture-specific memory coherence
#if defined(__gfx942__)
#define MUBUF_ACQUIRE 16
#define MUBUF_RELEASE 16
#elif (defined(__gfx908__) || defined(__gfx90a__))
#define MUBUF_ACQUIRE 1
#define MUBUF_RELEASE 0
#endif
// Tiling constants
static constexpr int kAtoms = 8;
static constexpr int kBlockSize = 256;
static constexpr int kTileSize = kBlockSize * kAtoms * sizeof(int32x4_t);
static constexpr int kMaxNumBlocks = 304 * 4;
static constexpr int kWavefront = 64;
// Buffer resource descriptor for MUBUF instructions
union BufferResource {
int32x4_t descriptor;
struct {
void* address;
uint32_t range;
uint32_t config;
};
};
// Vectorized buffer operations
static int32x4_t buffer_load_dwordx4(
int32x4_t srsrc, int32_t voffset,
int32_t soffset, int32_t aux);
static void buffer_store_dwordx4(
int32x4_t data, int32x4_t srsrc,
int32_t voffset, int32_t soffset, int32_t aux);
// Packed arithmetic templates
template <typename T>
__device__ __forceinline__ void packed_assign_add(
int32x4_t* A, int32x4_t* B);
template <typename T>
__device__ __forceinline__ int packed_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_min(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_abs_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_add(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_sub(int a, int b);
} // namespace quickreduce
Import
#include "base.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| A | int32x4_t* | Yes | Destination vector register (4 x int32, representing 8 packed fp16 or bf16 values) |
| B | int32x4_t* | Yes | Source vector register to add/combine with A |
| a, b | int | Yes | Packed fp16x2 or bf16x2 values for element-wise operations (max, min, add, sub) |
| srsrc | int32x4_t | Yes | Buffer resource descriptor for MUBUF load/store operations |
| voffset | int32_t | Yes | Vector offset for buffer addressing |
| soffset | int32_t | Yes | Scalar offset for buffer addressing |
Outputs
| Name | Type | Description |
|---|---|---|
| A (modified) | int32x4_t* | Result of packed_assign_add written back to the destination register |
| result | int | Packed fp16x2/bf16x2 result of element-wise operations |
| loaded data | int32x4_t | 128-bit data loaded from buffer via buffer_load_dwordx4 |
Usage Examples
// Inside a QuickReduce kernel on AMD GPU
using namespace quickreduce;
// Create buffer resource for remote GPU memory
BufferResource rsrc(remote_ptr, buffer_size);
// Load 128 bits from remote buffer
int32x4_t data = buffer_load_dwordx4(
rsrc.descriptor, thread_offset, 0, MUBUF_ACQUIRE);
// Packed FP16 addition of two vectors
int32x4_t local_data = ...;
packed_assign_add<half>(&local_data, &data);
// Store result back
buffer_store_dwordx4(
local_data, rsrc.descriptor, thread_offset, 0, MUBUF_RELEASE);