Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io RMS Normalization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Architecture
Last Updated 2026-02-08 04:29 GMT

Overview

Normalization technique that stabilizes hidden states by dividing by their root mean square, omitting the mean-centering step of standard Layer Normalization.

Description

RMS Normalization (RMSNorm) is a simplification of Layer Normalization that removes the mean-centering operation. Instead of computing both mean and variance, RMSNorm only computes the root mean square of the hidden states and rescales. This reduces computational cost while maintaining comparable performance. RMSNorm is the normalization method used in LLaMA and other modern LLM architectures, replacing the standard LayerNorm. The key advantage is reduced overhead: by skipping the mean subtraction step, the normalization becomes simpler and faster without sacrificing model quality.

Usage

Use this principle when building or understanding Transformer decoder architectures that follow the LLaMA design pattern. RMSNorm is applied before self-attention and before the feed-forward network in each decoder layer (Pre-Norm architecture). It is the standard normalization choice for modern large language models including LLaMA, Mistral, and Qwen.

Theoretical Basis

The RMSNorm operation is defined as:

RMSNorm(x)=x1di=1dxi2+ϵγ

Where:

  • x is the input hidden state vector of dimension d
  • ϵ is a small constant for numerical stability (default 106)
  • γ is a learnable scale parameter (initialized to ones)

Pseudo-code Logic:

# Abstract algorithm description (NOT real implementation)
variance = mean(x ** 2, dim=-1)
x_normalized = x * rsqrt(variance + eps)
output = weight * x_normalized

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment