Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU Common Header

From Leeroopedia


Knowledge Sources
Domains Kernel Infrastructure, CPU Compute
Last Updated 2026-02-10 00:00 GMT

Overview

Central header file providing shared macros, type dispatch utilities, validation helpers, and common functions used across all CPU kernel implementations in the sgl-kernel package.

Description

common.h is the foundation header for the entire CPU kernel subsystem. Every CPU kernel file (decode, extend, flash_attn, gemm, etc.) includes this header for shared definitions.

The file defines several key macro categories:

  • Boolean Dispatch Macros: AT_DISPATCH_BOOL and AT_DISPATCH_BOOL2 enable compile-time boolean template specialization from runtime boolean values. These generate two code paths (true/false) as constexpr, enabling the compiler to optimize away unused branches.
  • Type Dispatch Macros: CPU_DISPATCH_PACKED_TYPES dispatches across BFloat16, Half, int8_t, and Float8_e4m3fn types using a packed_t typedef. CPU_DISPATCH_REDUCED_FLOATING_TYPES_EXT handles mixed-dtype dispatch where the primary type (scalar_t for input/output/weight) and secondary type (param_t for bias) can differ.
  • Validation Macros: CHECK_CPU, CHECK_CONTIGUOUS, CHECK_INPUT, CHECK_DIM, CHECK_EQ, CHECK_GE, and CHECK_INPUT_SHAPE_DTYPE provide tensor validation with informative error messages.
  • Parallel Routines: Documents and provides parallel_for (with balance211 partitioning) alongside at::parallel_for for parallelizing kernel workloads across threads.
  • Helper Functions: data_index_init and data_index_step for multi-dimensional index decomposition, and div_up for ceiling integer division.

Usage

Include this header in any new CPU kernel implementation file. Use the dispatch macros to handle multiple data types, the validation macros to check input tensors, and the parallel routines for thread-level parallelism.

Code Reference

Source Location

Signature

// Boolean dispatch macros
#define AT_DISPATCH_BOOL(BOOL_V, BOOL_NAME, ...)
#define AT_DISPATCH_BOOL2(BOOL_V1, BOOL_NAME1, BOOL_V2, BOOL_NAME2, ...)

// Type dispatch macros
#define CPU_DISPATCH_PACKED_TYPES(TYPE, ...)
#define CPU_DISPATCH_REDUCED_FLOATING_TYPES_EXT(TYPE1, TYPE2, ...)

// Validation macros
#define CHECK_CPU(x)
#define CHECK_CONTIGUOUS(x)
#define CHECK_INPUT(x)
#define CHECK_DIM(d, x)
#define CHECK_EQ(a, b)
#define CHECK_GE(a, b)

template <bool is_only_lastdim_contiguous>
static inline void CHECK_INPUT_SHAPE_DTYPE(
    const at::Tensor& tensor,
    const at::IntArrayRef sizes,
    at::ScalarType st);

Import

#include "common.h"

I/O Contract

Inputs

Name Type Required Description
ATen/ATen.h Header Yes Core ATen tensor library
ATen/Parallel.h Header Yes Parallel execution utilities (at::parallel_for)
ATen/record_function.h Header Yes Performance profiling hooks
omp.h Header Conditional OpenMP support, included when _OPENMP is defined

Outputs

Name Type Description
Dispatch macros Preprocessor Generate type-specialized code paths from runtime type info
Validation macros Preprocessor Provide tensor shape, dtype, device, and contiguity checks
Helper functions Inline Common utilities for indexing and parallelism

Usage Examples

Type Dispatch for BFloat16/Half/INT8/FP8

CPU_DISPATCH_PACKED_TYPES(input.scalar_type(), [&] {
  // packed_t is now one of: at::BFloat16, at::Half, int8_t, at::Float8_e4m3fn
  my_kernel<packed_t>(input.data_ptr<packed_t>(), output.data_ptr<packed_t>(), N);
});

Input Validation

CHECK_INPUT(tensor);        // Checks CPU device + contiguous
CHECK_DIM(2, tensor);       // Checks tensor is 2D
CHECK_EQ(tensor.size(1), hidden_size);  // Checks dimension size

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment