Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang Kernel Utils Header

From Leeroopedia


Knowledge Sources
Domains GPU Kernels, C++ Headers, CUDA Utilities
Last Updated 2026-02-10 00:00 GMT

Overview

Shared utility header providing dtype dispatch macros, CUDA/HIP error handling, tensor validation macros, and common device helper functions used by all sgl_kernel CUDA/HIP kernel implementations.

Description

utils.h is the foundational utility header included by nearly all kernel source files in the SGLang kernel library. It provides the dtype dispatch infrastructure that enables kernels to support multiple precision formats through a single templated implementation.

The header is organized into several major sections:

Dtype Dispatch Macros: The core dispatch system maps PyTorch scalar types to native C++ CUDA types:

  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16 -- dispatches Half and BFloat16 to nv_half / nv_bfloat16 (or __half / __hip_bfloat16 on ROCm)
  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8 -- dispatches FP8 E4M3 and E5M2 types to __nv_fp8_e4m3 / __nv_fp8_e5m2
  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE -- combined dispatch for all supported floating point types
  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 -- dispatch for float32 plus FP16 types
  • DISPATCH_BOOL -- constexpr boolean dispatch for compile-time optimization
  • _DISPATCH_SWITCH and _DISPATCH_SWITCH_U16x2 -- generic switch dispatchers with packed uint16 support

Tensor Validation Macros: Convenience macros for kernel input validation:

  • CHECK_CUDA, CHECK_CONTIGUOUS, CHECK_INPUT -- basic tensor checks
  • CHECK_LAST_DIM_CONTIGUOUS -- ensures last dimension is contiguous (stride=1)
  • CHECK_DIM, CHECK_SHAPE, CHECK_EQ, CHECK_GE -- dimensional and shape assertions
  • CHECK_GQA_HEAD_DIVISIBLE -- validates GQA head count divisibility

CUDA Error Handling: The cuda_error exception class wraps CUDA errors into C++ exceptions, and the CHECK_CUDA_SUCCESS macro provides checked CUDA API calls with file/line error reporting.

Device Query Functions:

  • getSMVersion() -- returns the SM version (major*10 + minor) for architecture-specific code paths
  • isDeviceType() -- checks if the current device matches a specific GPU name
  • getBoolEnv() -- reads boolean environment variables
  • getEnvEnablePDL() -- thread-safe check for Programmatic Dependent Launch support (SM >= 90)

GPU Device Functions:

  • atomicMaxFloat -- atomic float maximum with platform-specific implementations for CUDA and ROCm
  • warpReduceMax -- warp-level maximum reduction using shuffle instructions
  • blockReduceMax -- block-level maximum reduction using shared memory
  • castToFloat / castFromFloat -- type casting helpers for device code

Utility Functions and Constants:

  • pad_tensor -- pads a tensor to alignment boundaries (default 4 rows) with optional column-major layout
  • next_pow2 -- computes the next power of 2 for a given value
  • pack_u16 -- packs two uint16 values into a uint32 for dispatch key packing
  • is_float8_tensor -- checks if a tensor is FP8 (E4M3 or E5M2)
  • CEILDIV, WARP_SIZE, FULL_MASK, SGLANG_LDG -- common constants and macros

The header uses conditional compilation extensively for ROCm compatibility, with #ifdef USE_ROCM guards for HIP-specific types, warp sizes (64 for GFX9 vs 32), and intrinsic function differences.

Usage

This header should be included by any CUDA/HIP kernel implementation file that needs dtype dispatching, tensor validation, or common device functions. It is the standard utility header for all kernel development in the sgl_kernel library.

Code Reference

Source Location

Signature

// Tensor validation macros
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
#define CHECK_DIM(d, x) TORCH_CHECK(x.dim() == d, #x " must be a " #d "D tensor")
#define CHECK_SHAPE(a, b) check_shape(a, b, #a, #b)

// Dtype dispatch macros
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(pytorch_dtype, c_type, ...)
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP8(pytorch_dtype, c_type, ...)
#define DISPATCH_PYTORCH_DTYPE_TO_CTYPE(pytorch_dtype, c_type, ...)
#define DISPATCH_BOOL(expr, const_expr, ...)

// Helper functions
inline void check_shape(const at::Tensor& a, const at::Tensor& b, const char* a_name, const char* b_name);
inline constexpr uint32_t pack_u16(uint16_t a, uint16_t b);
inline int getSMVersion();
inline bool isDeviceType(const std::string& device_type);
inline bool getBoolEnv(char const* name);
inline bool getEnvEnablePDL();
inline bool is_float8_tensor(const at::Tensor& tensor);
inline torch::Tensor pad_tensor(const torch::Tensor& tensor, int64_t alignment = 4, bool is_column_major = false);
inline uint32_t next_pow2(uint32_t x) noexcept;

// Device functions
__device__ __forceinline__ float atomicMaxFloat(float* addr, float value);
__device__ __forceinline__ float warpReduceMax(float value);
__device__ __forceinline__ float blockReduceMax(float value);

// Error handling
struct cuda_error : public std::runtime_error { ... };
#define CHECK_CUDA_SUCCESS(cmd)

Import

#include "utils.h"

I/O Contract

Inputs

Name Type Required Description
pytorch_dtype at::ScalarType Yes PyTorch dtype for dispatch macros to map to C++ types
tensor at::Tensor Yes Tensor to validate (for CHECK_* macros)
device_type std::string No GPU device name string for isDeviceType()
name char const* No Environment variable name for getBoolEnv()

Outputs

Name Type Description
SM version int GPU SM version as major*10+minor (e.g., 90 for Hopper)
PDL enabled bool Whether Programmatic Dependent Launch is enabled
Padded tensor torch::Tensor Tensor padded to alignment boundary
Next power of 2 uint32_t Next power of 2 for given input

Usage Examples

// Using dtype dispatch in a kernel implementation
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
    // c_type is now nv_half or nv_bfloat16
    my_kernel<c_type><<<grid, block>>>(input.data_ptr<c_type>(), ...);
    return true;
});

// Validating tensor inputs
CHECK_INPUT(input);     // checks CUDA + contiguous
CHECK_DIM(2, weight);   // checks 2D tensor
CHECK_SHAPE(a, b);      // checks matching shapes

// Querying device capabilities
int sm = getSMVersion();  // e.g., 90 for SM9.0
if (sm >= 90 && getEnvEnablePDL()) {
    // Use programmatic dependent launch
}

// Block-level reduction in device code
__device__ float maxVal = blockReduceMax(threadValue);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment