Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Quick Reduce

From Leeroopedia


Knowledge Sources
Domains AllReduce, Tensor_Parallel, AMD_GPU
Last Updated 2026-02-08 00:00 GMT

Overview

Implements a two-shot quantized all-reduce collective communication kernel for tensor parallelism on AMD CDNA GPUs.

Description

This header defines the DeviceComms struct that manages IPC memory handles and buffer allocation for multi-GPU all-reduce operations using HIP. It provides a templated allreduce method that dispatches to quantized two-shot kernels supporting FP16, INT8, INT6, and INT4 compression levels. The two-shot protocol splits the reduction into scatter-reduce and all-gather phases, with configurable world sizes of 2, 4, or 8 GPUs. Communication buffers are allocated using hipExtMallocWithFlags with uncached memory for low-latency inter-GPU transfers.

Usage

Use this when performing tensor-parallel inference on AMD GPUs where bandwidth-efficient all-reduce is needed. The quantized compression levels (INT4/INT6/INT8) reduce communication volume at the cost of minor precision loss, making it suitable for large hidden dimension reductions in transformer models.

Code Reference

Source Location

Signature

namespace quickreduce {

enum QuickReduceQuantLevel {
  F16 = 0,
  INT8 = 1,
  INT6 = 2,
  INT4 = 3,
};

struct DeviceComms {
  void init(int world_size, int rank,
            std::optional<int64_t> max_problem_size = std::nullopt);
  void destroy();
  void open_ipc_handles(std::vector<hipIpcMemHandle_t> const& ipc_handles);

  template <typename T, bool cast_bf2half>
  void allreduce(T const* A, T* B, uint32_t N, int quant_level,
                 hipStream_t stream);

  int get_world_size();
  int get_rank();
  bool status();
  hipIpcMemHandle_t const get_handle();
};

}  // namespace quickreduce

Import

#include <hip/hip_runtime.h>
#include "quick_reduce_impl.cuh"
#include "quick_reduce.h"

I/O Contract

Inputs

Name Type Required Description
world_size int Yes Number of GPUs participating (must be 2, 4, or 8)
rank int Yes Rank of the current GPU in the communication group
max_problem_size std::optional<int64_t> No Maximum data size in bytes (default 2GB)
A T const* Yes Input tensor data pointer for all-reduce
B T* Yes Output tensor data pointer for all-reduce result
N uint32_t Yes Number of elements to reduce
quant_level int Yes Compression level: 0=F16, 1=INT8, 2=INT6, 3=INT4
stream hipStream_t Yes HIP stream for asynchronous execution

Outputs

Name Type Description
B T* Output buffer containing the all-reduced result across all ranks

Usage Examples

// Initialize communication for 8-GPU tensor parallelism
quickreduce::DeviceComms comms;
comms.init(/*world_size=*/8, /*rank=*/my_rank);

// Exchange IPC handles (gathered from all ranks)
comms.open_ipc_handles(all_ipc_handles);

// Perform INT8-quantized all-reduce on hidden states
comms.allreduce<half, /*cast_bf2half=*/false>(
    input_ptr,      // source data
    output_ptr,     // destination data
    num_elements,   // number of elements
    quickreduce::QuickReduceQuantLevel::INT8,  // quantization level
    hip_stream      // HIP stream
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment