Implementation:Vllm project Vllm CPU Layernorm
| Knowledge Sources | |
|---|---|
| Domains | Normalization, CPU_Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements vectorized RMS layer normalization and fused add-RMS normalization for CPU-based transformer inference using SIMD and OpenMP parallelization.
Description
This file provides two core normalization operations: rms_norm computes Root Mean Square normalization over the hidden dimension, and fused_add_rms_norm combines a residual addition with RMS normalization in a single pass to reduce memory traffic. Both implementations use FP32Vec8 vectorization for SIMD-accelerated variance computation and normalized output generation, with OpenMP parallelization across tokens.
Usage
These functions are compiled into the vLLM CPU extension and called from the Python layer as CPU backend implementations for RMSNorm layers. They are essential for LLaMA-family and other modern transformer models that use RMS normalization instead of LayerNorm.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/layernorm.cpp
- Lines: 1-117
Signature
void rms_norm(torch::Tensor& out, torch::Tensor& input,
torch::Tensor& weight, double epsilon);
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual,
torch::Tensor& weight, double epsilon);
Import
#include "cpu_types.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | torch::Tensor | Yes | Input tensor [..., hidden_size] to be normalized |
| weight | torch::Tensor | Yes | Learnable scale parameters [hidden_size] |
| epsilon | double | Yes | Small constant for numerical stability in variance computation |
| residual | torch::Tensor | Yes (fused only) | Residual tensor [..., hidden_size] for fused add+norm variant |
| out | torch::Tensor | Yes (rms_norm only) | Pre-allocated output tensor [..., hidden_size] |
Outputs
| Name | Type | Description |
|---|---|---|
| out | torch::Tensor | Normalized output written in-place (rms_norm) |
| input | torch::Tensor | Normalized result written in-place (fused_add_rms_norm) |
| residual | torch::Tensor | Updated residual (input + residual) written in-place (fused_add_rms_norm) |
Usage Examples
// Standard RMS normalization
torch::Tensor input = torch::randn({num_tokens, hidden_size});
torch::Tensor weight = torch::ones({hidden_size});
torch::Tensor output = torch::empty_like(input);
rms_norm(output, input, weight, 1e-6);
// Fused add + RMS normalization (saves memory bandwidth)
torch::Tensor hidden_states = torch::randn({num_tokens, hidden_size});
torch::Tensor residual = torch::randn({num_tokens, hidden_size});
torch::Tensor norm_weight = torch::ones({hidden_size});
fused_add_rms_norm(hidden_states, residual, norm_weight, 1e-6);
// After call: residual = old_hidden_states + old_residual
// hidden_states = RMSNorm(residual) * weight