Implementation:Vllm project Vllm Quick Reduce

Knowledge Sources	vllm
Domains	AllReduce, Tensor_Parallel, AMD_GPU
Last Updated	2026-02-08 00:00 GMT

Overview

Implements a two-shot quantized all-reduce collective communication kernel for tensor parallelism on AMD CDNA GPUs.

Description

This header defines the DeviceComms struct that manages IPC memory handles and buffer allocation for multi-GPU all-reduce operations using HIP. It provides a templated allreduce method that dispatches to quantized two-shot kernels supporting FP16, INT8, INT6, and INT4 compression levels. The two-shot protocol splits the reduction into scatter-reduce and all-gather phases, with configurable world sizes of 2, 4, or 8 GPUs. Communication buffers are allocated using hipExtMallocWithFlags with uncached memory for low-latency inter-GPU transfers.

Usage

Use this when performing tensor-parallel inference on AMD GPUs where bandwidth-efficient all-reduce is needed. The quantized compression levels (INT4/INT6/INT8) reduce communication volume at the cost of minor precision loss, making it suitable for large hidden dimension reductions in transformer models.

Code Reference

Source Location

Repository: vllm
File: csrc/quickreduce/quick_reduce.h
Lines: 1-197

Signature

namespace quickreduce {

enum QuickReduceQuantLevel {
  F16 = 0,
  INT8 = 1,
  INT6 = 2,
  INT4 = 3,
};

struct DeviceComms {
  void init(int world_size, int rank,
            std::optional<int64_t> max_problem_size = std::nullopt);
  void destroy();
  void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);

  template <typename T, bool cast_bf2half>
  void allreduce(T const* A, T* B, uint32_t N, int quant_level,
                 hipStream_t stream);

  int get_world_size();
  int get_rank();
  bool status();
  hipIpcMemHandle_t const get_handle();
};

}  // namespace quickreduce

Import

#include <hip/hip_runtime.h>
#include "quick_reduce_impl.cuh"
#include "quick_reduce.h"

I/O Contract

Inputs

Name	Type	Required	Description
world_size	int	Yes	Number of GPUs participating (must be 2, 4, or 8)
rank	int	Yes	Rank of the current GPU in the communication group
max_problem_size	std::optional<int64_t>	No	Maximum data size in bytes (default 2GB)
A	T const*	Yes	Input tensor data pointer for all-reduce
B	T*	Yes	Output tensor data pointer for all-reduce result
N	uint32_t	Yes	Number of elements to reduce
quant_level	int	Yes	Compression level: 0=F16, 1=INT8, 2=INT6, 3=INT4
stream	hipStream_t	Yes	HIP stream for asynchronous execution

Outputs

Name	Type	Description
B	T*	Output buffer containing the all-reduced result across all ranks

Usage Examples

// Initialize communication for 8-GPU tensor parallelism
quickreduce::DeviceComms comms;
comms.init(/*world_size=*/8, /*rank=*/my_rank);

// Exchange IPC handles (gathered from all ranks)
comms.open_ipc_handles(all_ipc_handles);

// Perform INT8-quantized all-reduce on hidden states
comms.allreduce<half, /*cast_bf2half=*/false>(
    input_ptr,      // source data
    output_ptr,     // destination data
    num_elements,   // number of elements
    quickreduce::QuickReduceQuantLevel::INT8,  // quantization level
    hip_stream      // HIP stream
);

Related Pages

Environment:Vllm_project_Vllm_ROCm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment