Implementation:Sgl project Sglang Kernel Utils Header

Knowledge Sources	Sgl_project_Sglang
Domains	GPU Kernels, C++ Headers, CUDA Utilities
Last Updated	2026-02-10 00:00 GMT

Overview

Shared utility header providing dtype dispatch macros, CUDA/HIP error handling, tensor validation macros, and common device helper functions used by all sgl_kernel CUDA/HIP kernel implementations.

Description

utils.h is the foundational utility header included by nearly all kernel source files in the SGLang kernel library. It provides the dtype dispatch infrastructure that enables kernels to support multiple precision formats through a single templated implementation.

The header is organized into several major sections:

Dtype Dispatch Macros: The core dispatch system maps PyTorch scalar types to native C++ CUDA types:

DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16 -- dispatches Half and BFloat16 to nv_half / nv_bfloat16 (or __half / __hip_bfloat16 on ROCm)
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8 -- dispatches FP8 E4M3 and E5M2 types to __nv_fp8_e4m3 / __nv_fp8_e5m2
DISPATCH_PYTORCH_DTYPE_TO_CTYPE -- combined dispatch for all supported floating point types
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 -- dispatch for float32 plus FP16 types
DISPATCH_BOOL -- constexpr boolean dispatch for compile-time optimization
_DISPATCH_SWITCH and _DISPATCH_SWITCH_U16x2 -- generic switch dispatchers with packed uint16 support

Tensor Validation Macros: Convenience macros for kernel input validation:

CHECK_CUDA, CHECK_CONTIGUOUS, CHECK_INPUT -- basic tensor checks
CHECK_LAST_DIM_CONTIGUOUS -- ensures last dimension is contiguous (stride=1)
CHECK_DIM, CHECK_SHAPE, CHECK_EQ, CHECK_GE -- dimensional and shape assertions
CHECK_GQA_HEAD_DIVISIBLE -- validates GQA head count divisibility

CUDA Error Handling: The cuda_error exception class wraps CUDA errors into C++ exceptions, and the CHECK_CUDA_SUCCESS macro provides checked CUDA API calls with file/line error reporting.

Device Query Functions:

getSMVersion() -- returns the SM version (major*10 + minor) for architecture-specific code paths
isDeviceType() -- checks if the current device matches a specific GPU name
getBoolEnv() -- reads boolean environment variables
getEnvEnablePDL() -- thread-safe check for Programmatic Dependent Launch support (SM >= 90)

GPU Device Functions:

atomicMaxFloat -- atomic float maximum with platform-specific implementations for CUDA and ROCm
warpReduceMax -- warp-level maximum reduction using shuffle instructions
blockReduceMax -- block-level maximum reduction using shared memory
castToFloat / castFromFloat -- type casting helpers for device code

Utility Functions and Constants:

pad_tensor -- pads a tensor to alignment boundaries (default 4 rows) with optional column-major layout
next_pow2 -- computes the next power of 2 for a given value
pack_u16 -- packs two uint16 values into a uint32 for dispatch key packing
is_float8_tensor -- checks if a tensor is FP8 (E4M3 or E5M2)
CEILDIV, WARP_SIZE, FULL_MASK, SGLANG_LDG -- common constants and macros

The header uses conditional compilation extensively for ROCm compatibility, with #ifdef USE_ROCM guards for HIP-specific types, warp sizes (64 for GFX9 vs 32), and intrinsic function differences.

Usage

This header should be included by any CUDA/HIP kernel implementation file that needs dtype dispatching, tensor validation, or common device functions. It is the standard utility header for all kernel development in the sgl_kernel library.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/include/utils.h
Lines: 1-469

Signature

// Tensor validation macros
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
#define CHECK_DIM(d, x) TORCH_CHECK(x.dim() == d, #x " must be a " #d "D tensor")
#define CHECK_SHAPE(a, b) check_shape(a, b, #a, #b)

// Dtype dispatch macros
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(pytorch_dtype, c_type, ...)
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8(pytorch_dtype, c_type, ...)
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE(pytorch_dtype, c_type, ...)
#define DISPATCH_BOOL(expr, const_expr, ...)

// Helper functions
inline void check_shape(const at::Tensor& a, const at::Tensor& b, const char* a_name, const char* b_name);
inline constexpr uint32_t pack_u16(uint16_t a, uint16_t b);
inline int getSMVersion();
inline bool isDeviceType(const std::string& device_type);
inline bool getBoolEnv(char const* name);
inline bool getEnvEnablePDL();
inline bool is_float8_tensor(const at::Tensor& tensor);
inline torch::Tensor pad_tensor(const torch::Tensor& tensor, int64_t alignment = 4, bool is_column_major = false);
inline uint32_t next_pow2(uint32_t x) noexcept;

// Device functions
__device__ __forceinline__ float atomicMaxFloat(float* addr, float value);
__device__ __forceinline__ float warpReduceMax(float value);
__device__ __forceinline__ float blockReduceMax(float value);

// Error handling
struct cuda_error : public std::runtime_error { ... };
#define CHECK_CUDA_SUCCESS(cmd)

Import

#include "utils.h"

I/O Contract

Inputs

Name	Type	Required	Description
pytorch_dtype	at::ScalarType	Yes	PyTorch dtype for dispatch macros to map to C++ types
tensor	at::Tensor	Yes	Tensor to validate (for CHECK_* macros)
device_type	std::string	No	GPU device name string for isDeviceType()
name	char const*	No	Environment variable name for getBoolEnv()

Outputs

Name	Type	Description
SM version	int	GPU SM version as major*10+minor (e.g., 90 for Hopper)
PDL enabled	bool	Whether Programmatic Dependent Launch is enabled
Padded tensor	torch::Tensor	Tensor padded to alignment boundary
Next power of 2	uint32_t	Next power of 2 for given input

Usage Examples

// Using dtype dispatch in a kernel implementation
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
    // c_type is now nv_half or nv_bfloat16
    my_kernel<c_type><<<grid, block>>>(input.data_ptr<c_type>(), ...);
    return true;
});

// Validating tensor inputs
CHECK_INPUT(input);     // checks CUDA + contiguous
CHECK_DIM(2, weight);   // checks 2D tensor
CHECK_SHAPE(a, b);      // checks matching shapes

// Querying device capabilities
int sm = getSMVersion();  // e.g., 90 for SM9.0
if (sm >= 90 && getEnvEnablePDL()) {
    // Use programmatic dependent launch
}

// Block-level reduction in device code
__device__ float maxVal = blockReduceMax(threadValue);

Related Pages

Environment:Sgl_project_Sglang_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment