Implementation:Sgl project Sglang Allreduce Interface
| Knowledge Sources | |
|---|---|
| Domains | Kernel, Distributed Computing, Multi-GPU |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Python interface for custom allreduce operations used in multi-GPU tensor parallelism, providing platform-specific implementations for CUDA and ROCm.
Description
The allreduce.py module provides platform-specific implementations that branch on torch.version.hip (ROCm) versus CUDA. For ROCm, it exposes custom allreduce functions (init_custom_ar, all_reduce_reg, all_reduce_unreg, and deterministic variants), quick allreduce functions (init_custom_qr, qr_all_reduce, qr_destroy), and IPC buffer management (allocate_meta_buffer, get_meta_buffer_ipc_handle). For CUDA, it wraps custom allreduce (init_custom_ar, all_reduce, dispose) and MSCCLPP-based allreduce (mscclpp_generate_unique_id, mscclpp_init_context, mscclpp_allreduce). All functions delegate to torch.ops.sgl_kernel.* C++ ops. Common utilities include meta_size, register_buffer, get_graph_buffer_ipc_meta, and register_graph_buffers for managing shared memory buffers across GPUs.
Usage
Use these functions for multi-GPU inference when custom allreduce implementations are needed to bypass NCCL overhead for small tensor communications within a node.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/python/sgl_kernel/allreduce.py
- Lines: 1-186
Signature
# CUDA path
def init_custom_ar(
ipc_tensors: List[int], rank_data: torch.Tensor,
rank: int, full_nvlink: bool
) -> int: ...
def all_reduce(
fa: int, inp: torch.Tensor, out: torch.Tensor,
reg_buffer: int, reg_buffer_sz_bytes: int
) -> None: ...
def dispose(fa: int) -> None: ...
def get_graph_buffer_ipc_meta(fa) -> Tuple[List[int], List[int]]: ...
def register_buffer(fa: int, fake_ipc_ptrs: List[int]) -> None: ...
def register_graph_buffers(
fa: int, handles: List[List[int]], offsets: List[List[int]]
) -> None: ...
def meta_size() -> int: ...
def mscclpp_generate_unique_id() -> torch.Tensor: ...
def mscclpp_init_context(
unique_id: torch.Tensor, rank: int, world_size: int,
scratch: torch.Tensor, put_buffer: torch.Tensor,
nranks_per_node: int, rank_to_node: List[int],
rank_to_ib: List[int], context_selection: int
) -> int: ...
def mscclpp_allreduce(
context: int, inp: torch.Tensor, out: torch.Tensor,
nthreads: int, nblocks: int
) -> None: ...
Import
from sgl_kernel.allreduce import *
# or
from sgl_kernel import init_custom_ar, all_reduce, dispose
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fa | int | Yes | Allreduce context handle returned by init_custom_ar |
| inp | torch.Tensor | Yes | Input tensor to reduce |
| out | torch.Tensor | Yes | Output tensor for reduced result |
| rank | int | Yes | Current GPU rank in the group |
| full_nvlink | bool | Yes | Whether full NVLink connectivity is available |
Outputs
| Name | Type | Description |
|---|---|---|
| fa | int | Allreduce context handle (from init_custom_ar) |
| out | torch.Tensor | Reduced tensor result (written in-place) |
Usage Examples
from sgl_kernel.allreduce import (
init_custom_ar, all_reduce, dispose, meta_size
)
# Initialize custom allreduce
fa = init_custom_ar(ipc_tensors, rank_data, rank=0, full_nvlink=True)
# Perform allreduce
all_reduce(fa, inp_tensor, out_tensor, reg_buffer, reg_buffer_sz_bytes)
# Cleanup
dispose(fa)