Implementation:Sgl project Sglang CPU Normalization

Knowledge Sources	Sgl_project_Sglang
Domains	Machine Learning, CPU Kernels
Last Updated	2026-02-10 00:00 GMT

Overview

Implements CPU-optimized normalization kernels including L2 normalization, RMS normalization (with optional fused residual addition), and LayerNorm for LLM inference.

Description

This file provides multiple normalization implementations, each using SIMD vectorization via at::vec::Vectorized with float32 intermediate computation for numerical stability on BFloat16/Half inputs.

The internal kernel functions include:

l2norm_kernel_impl -- Computes L2 norm normalization (used by Llama4TextL2Norm) by computing sum of squares, then scaling by 1/sqrt(sum/hidden_size + eps).
rmsnorm_kernel_impl -- Implements RMS normalization with element-wise weight multiplication, templated with func_t and vec_func_t for custom functional transforms (identity for standard RMSNorm, x + 1 for Gemma-style).
gemma3_rmsnorm_kernel_4d_impl -- Specialized RMSNorm for 4D tensors used in Gemma3 models.
fused_add_rmsnorm_kernel_impl -- Fuses residual addition with RMS normalization in a single pass.
fused_rmsnorm_gated_kernel_impl -- Fuses RMSNorm with gated activation for architectures that gate the normalized output.
fused_add_layernorm_kernel_impl -- Fuses residual addition with LayerNorm (mean-subtraction + variance normalization).

All kernels parallelize across the batch dimension with at::parallel_for. The reduction (sum of squares) uses vec_reduce_sum for efficient horizontal vector summation. A note in the code explicitly warns against using at::vec::map<> on bfloat16/half types.

The public API functions exposed are: l2norm_cpu, rmsnorm_cpu, layernorm_cpu, gemma_rmsnorm_cpu, gemma3_rmsnorm_cpu, fused_rmsnorm_gated_cpu, fused_add_rmsnorm_cpu, gemma_fused_add_rmsnorm_cpu, and fused_add_layernorm_cpu.

Usage

Use these kernels for all normalization operations during CPU LLM inference. RMSNorm is used by LLaMA, Mistral, and Qwen models. LayerNorm is used by GPT-style architectures. L2Norm is used by Llama4. Gemma-style RMSNorm (with +1 offset) is used by Gemma models. The fused variants reduce memory bandwidth by combining residual addition with normalization in a single pass.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/norm.cpp
Lines: 1-811

Signature

// Public API functions
at::Tensor l2norm_cpu(at::Tensor& input, double eps);

at::Tensor rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

void layernorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor gemma_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor gemma3_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor fused_rmsnorm_gated_cpu(
    at::Tensor& input, at::Tensor& weight, at::Tensor& gate, double eps);

void fused_add_rmsnorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

void gemma_fused_add_rmsnorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

void fused_add_layernorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
input	at::Tensor [batch_size, hidden_size]	Yes	Input tensor to normalize (BFloat16 or Half)
weight	at::Tensor [hidden_size]	Depends	Normalization weight vector (not needed for l2norm)
residual	at::Tensor [batch_size, hidden_size]	Depends	Residual tensor for fused add variants (modified in-place)
gate	at::Tensor [batch_size, hidden_size]	Depends	Gating tensor for fused_rmsnorm_gated
eps	double	Yes	Epsilon for numerical stability (typically 1e-5 or 1e-6)

Outputs

Name	Type	Description
output	at::Tensor [batch_size, hidden_size]	Normalized output tensor (some functions return a new tensor, others modify input in-place)

Usage Examples

// L2 normalization (Llama4)
at::Tensor normed = l2norm_cpu(input, /*eps=*/1e-5);

// Standard RMSNorm (LLaMA, Mistral)
at::Tensor normed = rmsnorm_cpu(input, weight, /*eps=*/1e-5);

// Fused residual add + RMSNorm (single memory pass)
fused_add_rmsnorm_cpu(input, residual, weight, /*eps=*/1e-5);
// After call: input contains normalized(input + residual) * weight,
//             residual is updated to input + residual

// LayerNorm (GPT-style, modifies input in-place)
layernorm_cpu(input, weight, /*eps=*/1e-5);

// Gemma-style RMSNorm (weight + 1)
at::Tensor normed = gemma_rmsnorm_cpu(input, weight, /*eps=*/1e-5);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment