Implementation:Vllm project Vllm QuickReduce Base

Knowledge Sources	vllm
Domains	AllReduce, ROCm, Collective_Communication
Last Updated	2026-02-08 00:00 GMT

Overview

Defines ROCm/HIP-specific constants, buffer resource descriptors, vectorized arithmetic primitives, and memory coherence semantics for efficient all-reduce operations on AMD CDNA GPUs.

Description

This header provides the foundational types and device functions for the QuickReduce collective communication library targeting AMD GPUs (MI200, MI300 series). It defines architecture-specific memory coherence controls (MUBUF_ACQUIRE/MUBUF_RELEASE for CDNA1/2/3), buffer resource descriptors (BufferResource union for MUBUF instructions), and tiling constants (kBlockSize = 256 threads, kAtoms = 8 per thread, kTileSize = 32KB per workgroup). The file also implements packed vectorized arithmetic templates (packed_assign_add, packed_max, packed_min, packed_abs_max, packed_add, packed_sub) specialized for both half and nv_bfloat16 types using AMD GCN inline assembly (v_pk_add_f16, v_pk_max_f16, etc.).

Usage

This header is included by QuickReduce kernel implementations for AMD ROCm builds. It is compiled when building vLLM with ROCm support and is used during multi-GPU inference for tensor-parallel all-reduce operations on AMD CDNA architectures.

Code Reference

Source Location

Repository: vllm
File: csrc/quickreduce/base.h
Lines: 1-338

Signature

namespace quickreduce {

// Architecture-specific memory coherence
#if defined(__gfx942__)
  #define MUBUF_ACQUIRE 16
  #define MUBUF_RELEASE 16
#elif (defined(__gfx908__) || defined(__gfx90a__))
  #define MUBUF_ACQUIRE 1
  #define MUBUF_RELEASE 0
#endif

// Tiling constants
static constexpr int kAtoms = 8;
static constexpr int kBlockSize = 256;
static constexpr int kTileSize = kBlockSize * kAtoms * sizeof(int32x4_t);
static constexpr int kMaxNumBlocks = 304 * 4;
static constexpr int kWavefront = 64;

// Buffer resource descriptor for MUBUF instructions
union BufferResource {
    int32x4_t descriptor;
    struct {
        void* address;
        uint32_t range;
        uint32_t config;
    };
};

// Vectorized buffer operations
static int32x4_t buffer_load_dwordx4(
    int32x4_t srsrc, int32_t voffset,
    int32_t soffset, int32_t aux);
static void buffer_store_dwordx4(
    int32x4_t data, int32x4_t srsrc,
    int32_t voffset, int32_t soffset, int32_t aux);

// Packed arithmetic templates
template <typename T>
__device__ __forceinline__ void packed_assign_add(
    int32x4_t* A, int32x4_t* B);
template <typename T>
__device__ __forceinline__ int packed_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_min(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_abs_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_add(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_sub(int a, int b);

} // namespace quickreduce

Import

#include "base.h"

I/O Contract

Inputs

Name	Type	Required	Description
A	int32x4_t*	Yes	Destination vector register (4 x int32, representing 8 packed fp16 or bf16 values)
B	int32x4_t*	Yes	Source vector register to add/combine with A
a, b	int	Yes	Packed fp16x2 or bf16x2 values for element-wise operations (max, min, add, sub)
srsrc	int32x4_t	Yes	Buffer resource descriptor for MUBUF load/store operations
voffset	int32_t	Yes	Vector offset for buffer addressing
soffset	int32_t	Yes	Scalar offset for buffer addressing

Outputs

Name	Type	Description
A (modified)	int32x4_t*	Result of packed_assign_add written back to the destination register
result	int	Packed fp16x2/bf16x2 result of element-wise operations
loaded data	int32x4_t	128-bit data loaded from buffer via buffer_load_dwordx4

Usage Examples

// Inside a QuickReduce kernel on AMD GPU
using namespace quickreduce;

// Create buffer resource for remote GPU memory
BufferResource rsrc(remote_ptr, buffer_size);

// Load 128 bits from remote buffer
int32x4_t data = buffer_load_dwordx4(
    rsrc.descriptor, thread_offset, 0, MUBUF_ACQUIRE);

// Packed FP16 addition of two vectors
int32x4_t local_data = ...;
packed_assign_add<half>(&local_data, &data);

// Store result back
buffer_store_dwordx4(
    local_data, rsrc.descriptor, thread_offset, 0, MUBUF_RELEASE);

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment