Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm QuickReduce Base

From Leeroopedia


Knowledge Sources
Domains AllReduce, ROCm, Collective_Communication
Last Updated 2026-02-08 00:00 GMT

Overview

Defines ROCm/HIP-specific constants, buffer resource descriptors, vectorized arithmetic primitives, and memory coherence semantics for efficient all-reduce operations on AMD CDNA GPUs.

Description

This header provides the foundational types and device functions for the QuickReduce collective communication library targeting AMD GPUs (MI200, MI300 series). It defines architecture-specific memory coherence controls (MUBUF_ACQUIRE/MUBUF_RELEASE for CDNA1/2/3), buffer resource descriptors (BufferResource union for MUBUF instructions), and tiling constants (kBlockSize = 256 threads, kAtoms = 8 per thread, kTileSize = 32KB per workgroup). The file also implements packed vectorized arithmetic templates (packed_assign_add, packed_max, packed_min, packed_abs_max, packed_add, packed_sub) specialized for both half and nv_bfloat16 types using AMD GCN inline assembly (v_pk_add_f16, v_pk_max_f16, etc.).

Usage

This header is included by QuickReduce kernel implementations for AMD ROCm builds. It is compiled when building vLLM with ROCm support and is used during multi-GPU inference for tensor-parallel all-reduce operations on AMD CDNA architectures.

Code Reference

Source Location

Signature

namespace quickreduce {

// Architecture-specific memory coherence
#if defined(__gfx942__)
  #define MUBUF_ACQUIRE 16
  #define MUBUF_RELEASE 16
#elif (defined(__gfx908__) || defined(__gfx90a__))
  #define MUBUF_ACQUIRE 1
  #define MUBUF_RELEASE 0
#endif

// Tiling constants
static constexpr int kAtoms = 8;
static constexpr int kBlockSize = 256;
static constexpr int kTileSize = kBlockSize * kAtoms * sizeof(int32x4_t);
static constexpr int kMaxNumBlocks = 304 * 4;
static constexpr int kWavefront = 64;

// Buffer resource descriptor for MUBUF instructions
union BufferResource {
    int32x4_t descriptor;
    struct {
        void* address;
        uint32_t range;
        uint32_t config;
    };
};

// Vectorized buffer operations
static int32x4_t buffer_load_dwordx4(
    int32x4_t srsrc, int32_t voffset,
    int32_t soffset, int32_t aux);
static void buffer_store_dwordx4(
    int32x4_t data, int32x4_t srsrc,
    int32_t voffset, int32_t soffset, int32_t aux);

// Packed arithmetic templates
template <typename T>
__device__ __forceinline__ void packed_assign_add(
    int32x4_t* A, int32x4_t* B);
template <typename T>
__device__ __forceinline__ int packed_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_min(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_abs_max(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_add(int a, int b);
template <typename T>
__device__ __forceinline__ int packed_sub(int a, int b);

} // namespace quickreduce

Import

#include "base.h"

I/O Contract

Inputs

Name Type Required Description
A int32x4_t* Yes Destination vector register (4 x int32, representing 8 packed fp16 or bf16 values)
B int32x4_t* Yes Source vector register to add/combine with A
a, b int Yes Packed fp16x2 or bf16x2 values for element-wise operations (max, min, add, sub)
srsrc int32x4_t Yes Buffer resource descriptor for MUBUF load/store operations
voffset int32_t Yes Vector offset for buffer addressing
soffset int32_t Yes Scalar offset for buffer addressing

Outputs

Name Type Description
A (modified) int32x4_t* Result of packed_assign_add written back to the destination register
result int Packed fp16x2/bf16x2 result of element-wise operations
loaded data int32x4_t 128-bit data loaded from buffer via buffer_load_dwordx4

Usage Examples

// Inside a QuickReduce kernel on AMD GPU
using namespace quickreduce;

// Create buffer resource for remote GPU memory
BufferResource rsrc(remote_ptr, buffer_size);

// Load 128 bits from remote buffer
int32x4_t data = buffer_load_dwordx4(
    rsrc.descriptor, thread_offset, 0, MUBUF_ACQUIRE);

// Packed FP16 addition of two vectors
int32x4_t local_data = ...;
packed_assign_add<half>(&local_data, &data);

// Store result back
buffer_store_dwordx4(
    local_data, rsrc.descriptor, thread_offset, 0, MUBUF_RELEASE);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment