Implementation:Vllm project Vllm Quick Reduce
| Knowledge Sources | |
|---|---|
| Domains | AllReduce, Tensor_Parallel, AMD_GPU |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements a two-shot quantized all-reduce collective communication kernel for tensor parallelism on AMD CDNA GPUs.
Description
This header defines the DeviceComms struct that manages IPC memory handles and buffer allocation for multi-GPU all-reduce operations using HIP. It provides a templated allreduce method that dispatches to quantized two-shot kernels supporting FP16, INT8, INT6, and INT4 compression levels. The two-shot protocol splits the reduction into scatter-reduce and all-gather phases, with configurable world sizes of 2, 4, or 8 GPUs. Communication buffers are allocated using hipExtMallocWithFlags with uncached memory for low-latency inter-GPU transfers.
Usage
Use this when performing tensor-parallel inference on AMD GPUs where bandwidth-efficient all-reduce is needed. The quantized compression levels (INT4/INT6/INT8) reduce communication volume at the cost of minor precision loss, making it suitable for large hidden dimension reductions in transformer models.
Code Reference
Source Location
- Repository: vllm
- File: csrc/quickreduce/quick_reduce.h
- Lines: 1-197
Signature
namespace quickreduce {
enum QuickReduceQuantLevel {
F16 = 0,
INT8 = 1,
INT6 = 2,
INT4 = 3,
};
struct DeviceComms {
void init(int world_size, int rank,
std::optional<int64_t> max_problem_size = std::nullopt);
void destroy();
void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);
template <typename T, bool cast_bf2half>
void allreduce(T const* A, T* B, uint32_t N, int quant_level,
hipStream_t stream);
int get_world_size();
int get_rank();
bool status();
hipIpcMemHandle_t const get_handle();
};
} // namespace quickreduce
Import
#include <hip/hip_runtime.h>
#include "quick_reduce_impl.cuh"
#include "quick_reduce.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| world_size | int | Yes | Number of GPUs participating (must be 2, 4, or 8) |
| rank | int | Yes | Rank of the current GPU in the communication group |
| max_problem_size | std::optional<int64_t> | No | Maximum data size in bytes (default 2GB) |
| A | T const* | Yes | Input tensor data pointer for all-reduce |
| B | T* | Yes | Output tensor data pointer for all-reduce result |
| N | uint32_t | Yes | Number of elements to reduce |
| quant_level | int | Yes | Compression level: 0=F16, 1=INT8, 2=INT6, 3=INT4 |
| stream | hipStream_t | Yes | HIP stream for asynchronous execution |
Outputs
| Name | Type | Description |
|---|---|---|
| B | T* | Output buffer containing the all-reduced result across all ranks |
Usage Examples
// Initialize communication for 8-GPU tensor parallelism
quickreduce::DeviceComms comms;
comms.init(/*world_size=*/8, /*rank=*/my_rank);
// Exchange IPC handles (gathered from all ranks)
comms.open_ipc_handles(all_ipc_handles);
// Perform INT8-quantized all-reduce on hidden states
comms.allreduce<half, /*cast_bf2half=*/false>(
input_ptr, // source data
output_ptr, // destination data
num_elements, // number of elements
quickreduce::QuickReduceQuantLevel::INT8, // quantization level
hip_stream // HIP stream
);