Principle:LLMBook zh LLMBook zh github io RMS Normalization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture |
| Last Updated | 2026-02-08 04:29 GMT |
Overview
Normalization technique that stabilizes hidden states by dividing by their root mean square, omitting the mean-centering step of standard Layer Normalization.
Description
RMS Normalization (RMSNorm) is a simplification of Layer Normalization that removes the mean-centering operation. Instead of computing both mean and variance, RMSNorm only computes the root mean square of the hidden states and rescales. This reduces computational cost while maintaining comparable performance. RMSNorm is the normalization method used in LLaMA and other modern LLM architectures, replacing the standard LayerNorm. The key advantage is reduced overhead: by skipping the mean subtraction step, the normalization becomes simpler and faster without sacrificing model quality.
Usage
Use this principle when building or understanding Transformer decoder architectures that follow the LLaMA design pattern. RMSNorm is applied before self-attention and before the feed-forward network in each decoder layer (Pre-Norm architecture). It is the standard normalization choice for modern large language models including LLaMA, Mistral, and Qwen.
Theoretical Basis
The RMSNorm operation is defined as:
Where:
- is the input hidden state vector of dimension
- is a small constant for numerical stability (default )
- is a learnable scale parameter (initialized to ones)
Pseudo-code Logic:
# Abstract algorithm description (NOT real implementation)
variance = mean(x ** 2, dim=-1)
x_normalized = x * rsqrt(variance + eps)
output = weight * x_normalized