Implementation:Sgl project Sglang CPU Normalization
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, CPU Kernels |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements CPU-optimized normalization kernels including L2 normalization, RMS normalization (with optional fused residual addition), and LayerNorm for LLM inference.
Description
This file provides multiple normalization implementations, each using SIMD vectorization via at::vec::Vectorized with float32 intermediate computation for numerical stability on BFloat16/Half inputs.
The internal kernel functions include:
- l2norm_kernel_impl -- Computes L2 norm normalization (used by Llama4TextL2Norm) by computing sum of squares, then scaling by 1/sqrt(sum/hidden_size + eps).
- rmsnorm_kernel_impl -- Implements RMS normalization with element-wise weight multiplication, templated with func_t and vec_func_t for custom functional transforms (identity for standard RMSNorm, x + 1 for Gemma-style).
- gemma3_rmsnorm_kernel_4d_impl -- Specialized RMSNorm for 4D tensors used in Gemma3 models.
- fused_add_rmsnorm_kernel_impl -- Fuses residual addition with RMS normalization in a single pass.
- fused_rmsnorm_gated_kernel_impl -- Fuses RMSNorm with gated activation for architectures that gate the normalized output.
- fused_add_layernorm_kernel_impl -- Fuses residual addition with LayerNorm (mean-subtraction + variance normalization).
All kernels parallelize across the batch dimension with at::parallel_for. The reduction (sum of squares) uses vec_reduce_sum for efficient horizontal vector summation. A note in the code explicitly warns against using at::vec::map<> on bfloat16/half types.
The public API functions exposed are: l2norm_cpu, rmsnorm_cpu, layernorm_cpu, gemma_rmsnorm_cpu, gemma3_rmsnorm_cpu, fused_rmsnorm_gated_cpu, fused_add_rmsnorm_cpu, gemma_fused_add_rmsnorm_cpu, and fused_add_layernorm_cpu.
Usage
Use these kernels for all normalization operations during CPU LLM inference. RMSNorm is used by LLaMA, Mistral, and Qwen models. LayerNorm is used by GPT-style architectures. L2Norm is used by Llama4. Gemma-style RMSNorm (with +1 offset) is used by Gemma models. The fused variants reduce memory bandwidth by combining residual addition with normalization in a single pass.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/norm.cpp
- Lines: 1-811
Signature
// Public API functions
at::Tensor l2norm_cpu(at::Tensor& input, double eps);
at::Tensor rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
void layernorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
at::Tensor gemma_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
at::Tensor gemma3_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);
at::Tensor fused_rmsnorm_gated_cpu(
at::Tensor& input, at::Tensor& weight, at::Tensor& gate, double eps);
void fused_add_rmsnorm_cpu(
at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);
void gemma_fused_add_rmsnorm_cpu(
at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);
void fused_add_layernorm_cpu(
at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);
Import
#include "common.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | at::Tensor [batch_size, hidden_size] | Yes | Input tensor to normalize (BFloat16 or Half) |
| weight | at::Tensor [hidden_size] | Depends | Normalization weight vector (not needed for l2norm) |
| residual | at::Tensor [batch_size, hidden_size] | Depends | Residual tensor for fused add variants (modified in-place) |
| gate | at::Tensor [batch_size, hidden_size] | Depends | Gating tensor for fused_rmsnorm_gated |
| eps | double | Yes | Epsilon for numerical stability (typically 1e-5 or 1e-6) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | at::Tensor [batch_size, hidden_size] | Normalized output tensor (some functions return a new tensor, others modify input in-place) |
Usage Examples
// L2 normalization (Llama4)
at::Tensor normed = l2norm_cpu(input, /*eps=*/1e-5);
// Standard RMSNorm (LLaMA, Mistral)
at::Tensor normed = rmsnorm_cpu(input, weight, /*eps=*/1e-5);
// Fused residual add + RMSNorm (single memory pass)
fused_add_rmsnorm_cpu(input, residual, weight, /*eps=*/1e-5);
// After call: input contains normalized(input + residual) * weight,
// residual is updated to input + residual
// LayerNorm (GPT-style, modifies input in-place)
layernorm_cpu(input, weight, /*eps=*/1e-5);
// Gemma-style RMSNorm (weight + 1)
at::Tensor normed = gemma_rmsnorm_cpu(input, weight, /*eps=*/1e-5);