Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU Normalization

From Leeroopedia


Knowledge Sources
Domains Machine Learning, CPU Kernels
Last Updated 2026-02-10 00:00 GMT

Overview

Implements CPU-optimized normalization kernels including L2 normalization, RMS normalization (with optional fused residual addition), and LayerNorm for LLM inference.

Description

This file provides multiple normalization implementations, each using SIMD vectorization via at::vec::Vectorized with float32 intermediate computation for numerical stability on BFloat16/Half inputs.

The internal kernel functions include:

  • l2norm_kernel_impl -- Computes L2 norm normalization (used by Llama4TextL2Norm) by computing sum of squares, then scaling by 1/sqrt(sum/hidden_size + eps).
  • rmsnorm_kernel_impl -- Implements RMS normalization with element-wise weight multiplication, templated with func_t and vec_func_t for custom functional transforms (identity for standard RMSNorm, x + 1 for Gemma-style).
  • gemma3_rmsnorm_kernel_4d_impl -- Specialized RMSNorm for 4D tensors used in Gemma3 models.
  • fused_add_rmsnorm_kernel_impl -- Fuses residual addition with RMS normalization in a single pass.
  • fused_rmsnorm_gated_kernel_impl -- Fuses RMSNorm with gated activation for architectures that gate the normalized output.
  • fused_add_layernorm_kernel_impl -- Fuses residual addition with LayerNorm (mean-subtraction + variance normalization).

All kernels parallelize across the batch dimension with at::parallel_for. The reduction (sum of squares) uses vec_reduce_sum for efficient horizontal vector summation. A note in the code explicitly warns against using at::vec::map<> on bfloat16/half types.

The public API functions exposed are: l2norm_cpu, rmsnorm_cpu, layernorm_cpu, gemma_rmsnorm_cpu, gemma3_rmsnorm_cpu, fused_rmsnorm_gated_cpu, fused_add_rmsnorm_cpu, gemma_fused_add_rmsnorm_cpu, and fused_add_layernorm_cpu.

Usage

Use these kernels for all normalization operations during CPU LLM inference. RMSNorm is used by LLaMA, Mistral, and Qwen models. LayerNorm is used by GPT-style architectures. L2Norm is used by Llama4. Gemma-style RMSNorm (with +1 offset) is used by Gemma models. The fused variants reduce memory bandwidth by combining residual addition with normalization in a single pass.

Code Reference

Source Location

Signature

// Public API functions
at::Tensor l2norm_cpu(at::Tensor& input, double eps);

at::Tensor rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

void layernorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor gemma_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor gemma3_rmsnorm_cpu(at::Tensor& input, at::Tensor& weight, double eps);

at::Tensor fused_rmsnorm_gated_cpu(
    at::Tensor& input, at::Tensor& weight, at::Tensor& gate, double eps);

void fused_add_rmsnorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

void gemma_fused_add_rmsnorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

void fused_add_layernorm_cpu(
    at::Tensor& input, at::Tensor& residual, at::Tensor& weight, double eps);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
input at::Tensor [batch_size, hidden_size] Yes Input tensor to normalize (BFloat16 or Half)
weight at::Tensor [hidden_size] Depends Normalization weight vector (not needed for l2norm)
residual at::Tensor [batch_size, hidden_size] Depends Residual tensor for fused add variants (modified in-place)
gate at::Tensor [batch_size, hidden_size] Depends Gating tensor for fused_rmsnorm_gated
eps double Yes Epsilon for numerical stability (typically 1e-5 or 1e-6)

Outputs

Name Type Description
output at::Tensor [batch_size, hidden_size] Normalized output tensor (some functions return a new tensor, others modify input in-place)

Usage Examples

// L2 normalization (Llama4)
at::Tensor normed = l2norm_cpu(input, /*eps=*/1e-5);

// Standard RMSNorm (LLaMA, Mistral)
at::Tensor normed = rmsnorm_cpu(input, weight, /*eps=*/1e-5);

// Fused residual add + RMSNorm (single memory pass)
fused_add_rmsnorm_cpu(input, residual, weight, /*eps=*/1e-5);
// After call: input contains normalized(input + residual) * weight,
//             residual is updated to input + residual

// LayerNorm (GPT-style, modifies input in-place)
layernorm_cpu(input, weight, /*eps=*/1e-5);

// Gemma-style RMSNorm (weight + 1)
at::Tensor normed = gemma_rmsnorm_cpu(input, weight, /*eps=*/1e-5);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment