Implementation:Vllm project Vllm CPU Layernorm

Knowledge Sources	vllm
Domains	Normalization, CPU_Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Implements vectorized RMS layer normalization and fused add-RMS normalization for CPU-based transformer inference using SIMD and OpenMP parallelization.

Description

This file provides two core normalization operations: rms_norm computes Root Mean Square normalization over the hidden dimension, and fused_add_rms_norm combines a residual addition with RMS normalization in a single pass to reduce memory traffic. Both implementations use FP32Vec8 vectorization for SIMD-accelerated variance computation and normalized output generation, with OpenMP parallelization across tokens.

Usage

These functions are compiled into the vLLM CPU extension and called from the Python layer as CPU backend implementations for RMSNorm layers. They are essential for LLaMA-family and other modern transformer models that use RMS normalization instead of LayerNorm.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/layernorm.cpp
Lines: 1-117

Signature

void rms_norm(torch::Tensor& out, torch::Tensor& input,
              torch::Tensor& weight, double epsilon);

void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual,
                        torch::Tensor& weight, double epsilon);

Import

#include "cpu_types.hpp"

I/O Contract

Inputs

Name	Type	Required	Description
input	torch::Tensor	Yes	Input tensor [..., hidden_size] to be normalized
weight	torch::Tensor	Yes	Learnable scale parameters [hidden_size]
epsilon	double	Yes	Small constant for numerical stability in variance computation
residual	torch::Tensor	Yes (fused only)	Residual tensor [..., hidden_size] for fused add+norm variant
out	torch::Tensor	Yes (rms_norm only)	Pre-allocated output tensor [..., hidden_size]

Outputs

Name	Type	Description
out	torch::Tensor	Normalized output written in-place (rms_norm)
input	torch::Tensor	Normalized result written in-place (fused_add_rms_norm)
residual	torch::Tensor	Updated residual (input + residual) written in-place (fused_add_rms_norm)

Usage Examples

// Standard RMS normalization
torch::Tensor input = torch::randn({num_tokens, hidden_size});
torch::Tensor weight = torch::ones({hidden_size});
torch::Tensor output = torch::empty_like(input);
rms_norm(output, input, weight, 1e-6);

// Fused add + RMS normalization (saves memory bandwidth)
torch::Tensor hidden_states = torch::randn({num_tokens, hidden_size});
torch::Tensor residual = torch::randn({num_tokens, hidden_size});
torch::Tensor norm_weight = torch::ones({hidden_size});
fused_add_rms_norm(hidden_states, residual, norm_weight, 1e-6);
// After call: residual = old_hidden_states + old_residual
//             hidden_states = RMSNorm(residual) * weight

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment