Implementation:Sgl project Sglang Allreduce Interface

Knowledge Sources	Sgl_project_Sglang
Domains	Kernel, Distributed Computing, Multi-GPU
Last Updated	2026-02-10 00:00 GMT

Overview

Python interface for custom allreduce operations used in multi-GPU tensor parallelism, providing platform-specific implementations for CUDA and ROCm.

Description

The allreduce.py module provides platform-specific implementations that branch on torch.version.hip (ROCm) versus CUDA. For ROCm, it exposes custom allreduce functions (init_custom_ar, all_reduce_reg, all_reduce_unreg, and deterministic variants), quick allreduce functions (init_custom_qr, qr_all_reduce, qr_destroy), and IPC buffer management (allocate_meta_buffer, get_meta_buffer_ipc_handle). For CUDA, it wraps custom allreduce (init_custom_ar, all_reduce, dispose) and MSCCLPP-based allreduce (mscclpp_generate_unique_id, mscclpp_init_context, mscclpp_allreduce). All functions delegate to torch.ops.sgl_kernel.* C++ ops. Common utilities include meta_size, register_buffer, get_graph_buffer_ipc_meta, and register_graph_buffers for managing shared memory buffers across GPUs.

Usage

Use these functions for multi-GPU inference when custom allreduce implementations are needed to bypass NCCL overhead for small tensor communications within a node.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/python/sgl_kernel/allreduce.py
Lines: 1-186

Signature

# CUDA path
def init_custom_ar(
    ipc_tensors: List[int], rank_data: torch.Tensor,
    rank: int, full_nvlink: bool
) -> int: ...

def all_reduce(
    fa: int, inp: torch.Tensor, out: torch.Tensor,
    reg_buffer: int, reg_buffer_sz_bytes: int
) -> None: ...

def dispose(fa: int) -> None: ...

def get_graph_buffer_ipc_meta(fa) -> Tuple[List[int], List[int]]: ...

def register_buffer(fa: int, fake_ipc_ptrs: List[int]) -> None: ...

def register_graph_buffers(
    fa: int, handles: List[List[int]], offsets: List[List[int]]
) -> None: ...

def meta_size() -> int: ...

def mscclpp_generate_unique_id() -> torch.Tensor: ...

def mscclpp_init_context(
    unique_id: torch.Tensor, rank: int, world_size: int,
    scratch: torch.Tensor, put_buffer: torch.Tensor,
    nranks_per_node: int, rank_to_node: List[int],
    rank_to_ib: List[int], context_selection: int
) -> int: ...

def mscclpp_allreduce(
    context: int, inp: torch.Tensor, out: torch.Tensor,
    nthreads: int, nblocks: int
) -> None: ...

Import

from sgl_kernel.allreduce import *
# or
from sgl_kernel import init_custom_ar, all_reduce, dispose

I/O Contract

Inputs

Name	Type	Required	Description
fa	int	Yes	Allreduce context handle returned by init_custom_ar
inp	torch.Tensor	Yes	Input tensor to reduce
out	torch.Tensor	Yes	Output tensor for reduced result
rank	int	Yes	Current GPU rank in the group
full_nvlink	bool	Yes	Whether full NVLink connectivity is available

Outputs

Name	Type	Description
fa	int	Allreduce context handle (from init_custom_ar)
out	torch.Tensor	Reduced tensor result (written in-place)

Usage Examples

from sgl_kernel.allreduce import (
    init_custom_ar, all_reduce, dispose, meta_size
)

# Initialize custom allreduce
fa = init_custom_ar(ipc_tensors, rank_data, rank=0, full_nvlink=True)

# Perform allreduce
all_reduce(fa, inp_tensor, out_tensor, reg_buffer, reg_buffer_sz_bytes)

# Cleanup
dispose(fa)

Related Pages

Environment:Sgl_project_Sglang_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment