Implementation:Sgl project Sglang CPU Torch Extension
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, PyTorch_Extensions, Kernel_Registration |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Central PyTorch C++ extension registration file that declares and binds all CPU kernel operations to the sgl_kernel torch library for CPU dispatch.
Description
The torch_extension_cpu.cpp file serves as the main registration point for the entire CPU kernel library in SGLang. It forward-declares all CPU kernel functions -- covering activations (silu_and_mul, gelu_and_mul, gelu_tanh_and_mul), normalization (rmsnorm, layernorm, l2norm, fused_add_rmsnorm, fused_rmsnorm_gated), topk routing (topk_sigmoid, topk_softmax, grouped_topk, biased_grouped_topk), attention (decode_attention, extend_attention, flash_attn_varlen_func), linear attention (chunk_gated_delta_rule), quantized GEMM (int8_scaled_mm, fp8_scaled_mm, int4_scaled_mm), fused MoE (fused_experts, shared_expert), weight absorption (qkv_proj_with_rope), shared memory collectives (shm_allreduce, shm_allgather), and rotary embedding -- then registers each via TORCH_LIBRARY_FRAGMENT with m.def() and m.impl() calls targeting the torch::kCPU dispatch key. This enables PyTorch's dispatcher to route calls to these CPU-specific implementations when tensors are on CPU.
Usage
This file is compiled as part of the sgl_kernel C++ extension build. It is used automatically when Python code calls any sgl_kernel operation with CPU tensors; the PyTorch dispatcher routes to the registered CPU implementations.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/torch_extension_cpu.cpp
- Lines: 1-546
Signature
// Activation kernels
at::Tensor silu_and_mul_cpu(at::Tensor& input);
at::Tensor gelu_tanh_and_mul_cpu(const at::Tensor& input);
at::Tensor gelu_and_mul_cpu(const at::Tensor& input);
// Normalization kernels
at::Tensor rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
void layernorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
at::Tensor l2norm_cpu(at::Tensor& input, double eps);
void fused_add_rmsnorm_cpu(at::Tensor& input, at::Tensor& residual,
at::Tensor& weight, double eps);
// TopK routing
std::tuple<at::Tensor, at::Tensor>
topk_sigmoid_cpu(at::Tensor& hidden_states, at::Tensor& gating_output,
int64_t topk, bool renormalize);
// Attention
void decode_attention_cpu(at::Tensor& query, at::Tensor& k_cache,
at::Tensor& v_cache, at::Tensor& output, at::Tensor& key,
at::Tensor& value, at::Tensor& loc, at::Tensor& attn_logits,
at::Tensor& req_to_token, at::Tensor& req_pool_indices,
at::Tensor& seq_lens, double sm_scale, double logit_cap);
// Quantized GEMM
at::Tensor int8_scaled_mm_cpu(at::Tensor& mat1, at::Tensor& mat2,
at::Tensor& scales1, at::Tensor& scales2,
const std::optional<at::Tensor>& bias,
at::ScalarType out_dtype, bool is_vnni);
// Fused MoE
at::Tensor fused_experts_cpu(at::Tensor& hidden_states, at::Tensor& w1,
at::Tensor& w2, at::Tensor& topk_weights, at::Tensor& topk_ids,
bool inplace, int64_t moe_comp_method, ...);
// Library registration macro
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m) { ... }
Import
#include <ATen/ATen.h>
#include <torch/all.h>
#include <torch/library.h>
#include "sgl_kernel_ops.h"
#include "shm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | at::Tensor | Yes | Input tensor for activation, norm, or GEMM operations |
| weight | at::Tensor | Varies | Weight tensor for normalization or GEMM |
| eps | double | Varies | Epsilon value for numerical stability in normalization |
| topk | int64_t | Varies | Number of top-k experts to select for MoE routing |
| sm_scale | double | Varies | Softmax scaling factor for attention |
| is_vnni | bool | Varies | Whether to use VNNI-packed weight layout for GEMM |
Outputs
| Name | Type | Description |
|---|---|---|
| result | at::Tensor | Output tensor from activation, norm, or GEMM operations |
| (topk_weights, topk_ids) | std::tuple<at::Tensor, at::Tensor> | TopK routing weights and expert indices |
| void | void | In-place operations (layernorm, fused_add_rmsnorm, decode_attention) |
Usage Examples
// Registration pattern used throughout the file
TORCH_LIBRARY_FRAGMENT(sgl_kernel, m) {
// Define the operation schema
m.def("silu_and_mul_cpu(Tensor input) -> Tensor");
// Bind the CPU implementation
m.impl("silu_and_mul_cpu", torch::kCPU, &silu_and_mul_cpu);
// Quantized GEMM with scales
m.def("int8_scaled_mm_cpu(Tensor mat1, Tensor mat2, Tensor scales1, "
"Tensor scales2, Tensor? bias, ScalarType out_dtype, "
"bool is_vnni) -> Tensor");
m.impl("int8_scaled_mm_cpu", torch::kCPU, &int8_scaled_mm_cpu);
}