Implementation:Sgl project Sglang CPU Torch Extension

Knowledge Sources	Sgl_project_Sglang
Domains	CPU_Inference, PyTorch_Extensions, Kernel_Registration
Last Updated	2026-02-10 00:00 GMT

Overview

Central PyTorch C++ extension registration file that declares and binds all CPU kernel operations to the sgl_kernel torch library for CPU dispatch.

Description

The torch_extension_cpu.cpp file serves as the main registration point for the entire CPU kernel library in SGLang. It forward-declares all CPU kernel functions -- covering activations (silu_and_mul, gelu_and_mul, gelu_tanh_and_mul), normalization (rmsnorm, layernorm, l2norm, fused_add_rmsnorm, fused_rmsnorm_gated), topk routing (topk_sigmoid, topk_softmax, grouped_topk, biased_grouped_topk), attention (decode_attention, extend_attention, flash_attn_varlen_func), linear attention (chunk_gated_delta_rule), quantized GEMM (int8_scaled_mm, fp8_scaled_mm, int4_scaled_mm), fused MoE (fused_experts, shared_expert), weight absorption (qkv_proj_with_rope), shared memory collectives (shm_allreduce, shm_allgather), and rotary embedding -- then registers each via TORCH_LIBRARY_FRAGMENT with m.def() and m.impl() calls targeting the torch::kCPU dispatch key. This enables PyTorch's dispatcher to route calls to these CPU-specific implementations when tensors are on CPU.

Usage

This file is compiled as part of the sgl_kernel C++ extension build. It is used automatically when Python code calls any sgl_kernel operation with CPU tensors; the PyTorch dispatcher routes to the registered CPU implementations.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/torch_extension_cpu.cpp
Lines: 1-546

Signature

// Activation kernels
at::Tensor silu_and_mul_cpu(at::Tensor& input);
at::Tensor gelu_tanh_and_mul_cpu(const at::Tensor& input);
at::Tensor gelu_and_mul_cpu(const at::Tensor& input);

// Normalization kernels
at::Tensor rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
void layernorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
at::Tensor l2norm_cpu(at::Tensor& input, double eps);
void fused_add_rmsnorm_cpu(at::Tensor& input, at::Tensor& residual,
                           at::Tensor& weight, double eps);

// TopK routing
std::tuple<at::Tensor, at::Tensor>
topk_sigmoid_cpu(at::Tensor& hidden_states, at::Tensor& gating_output,
                 int64_t topk, bool renormalize);

// Attention
void decode_attention_cpu(at::Tensor& query, at::Tensor& k_cache,
    at::Tensor& v_cache, at::Tensor& output, at::Tensor& key,
    at::Tensor& value, at::Tensor& loc, at::Tensor& attn_logits,
    at::Tensor& req_to_token, at::Tensor& req_pool_indices,
    at::Tensor& seq_lens, double sm_scale, double logit_cap);

// Quantized GEMM
at::Tensor int8_scaled_mm_cpu(at::Tensor& mat1, at::Tensor& mat2,
    at::Tensor& scales1, at::Tensor& scales2,
    const std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype, bool is_vnni);

// Fused MoE
at::Tensor fused_experts_cpu(at::Tensor& hidden_states, at::Tensor& w1,
    at::Tensor& w2, at::Tensor& topk_weights, at::Tensor& topk_ids,
    bool inplace, int64_t moe_comp_method, ...);

// Library registration macro
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m) { ... }

Import

#include <ATen/ATen.h>
#include <torch/all.h>
#include <torch/library.h>
#include "sgl_kernel_ops.h"
#include "shm.h"

I/O Contract

Inputs

Name	Type	Required	Description
input	at::Tensor	Yes	Input tensor for activation, norm, or GEMM operations
weight	at::Tensor	Varies	Weight tensor for normalization or GEMM
eps	double	Varies	Epsilon value for numerical stability in normalization
topk	int64_t	Varies	Number of top-k experts to select for MoE routing
sm_scale	double	Varies	Softmax scaling factor for attention
is_vnni	bool	Varies	Whether to use VNNI-packed weight layout for GEMM

Outputs

Name	Type	Description
result	at::Tensor	Output tensor from activation, norm, or GEMM operations
(topk_weights, topk_ids)	std::tuple<at::Tensor, at::Tensor>	TopK routing weights and expert indices
void	void	In-place operations (layernorm, fused_add_rmsnorm, decode_attention)

Usage Examples

// Registration pattern used throughout the file
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m) {
  // Define the operation schema
  m.def("silu_and_mul_cpu(Tensor input) -> Tensor");
  // Bind the CPU implementation
  m.impl("silu_and_mul_cpu", torch::kCPU, &silu_and_mul_cpu);

  // Quantized GEMM with scales
  m.def("int8_scaled_mm_cpu(Tensor mat1, Tensor mat2, Tensor scales1, "
        "Tensor scales2, Tensor? bias, ScalarType out_dtype, "
        "bool is_vnni) -> Tensor");
  m.impl("int8_scaled_mm_cpu", torch::kCPU, &int8_scaled_mm_cpu);
}

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment