Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs Layer Normalization

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Normalization, Natural Language Processing
Last Updated 2026-02-08 00:00 GMT

Overview

Layer normalization computes normalization statistics across all features within a single sample, providing batch-independent normalization that has become the standard in transformer architectures.

Description

Layer normalization (LN) normalizes activations across the entire feature dimension of each individual sample, computing mean and variance over all hidden units within a layer. This contrasts with batch normalization, which computes statistics across the batch dimension for each feature.

The fundamental advantage of layer normalization is its complete independence from batch statistics. Each sample is normalized using only its own activations, which means:

  • The normalization behaves identically during training and inference
  • There are no running statistics to maintain
  • It works correctly with batch size of 1
  • It is naturally suited to variable-length sequences where different samples may have different lengths

Layer normalization has become the de facto standard in transformer architectures, where it is applied before or after each sub-layer (self-attention and feed-forward). In the "pre-norm" variant, normalization is applied before the attention/feed-forward computation, while in the "post-norm" variant, it is applied after the residual addition.

After computing normalized values, learnable scale (γ) and shift (β) parameters (one per feature dimension) allow the network to adaptively control the normalized distribution. The normalized shape parameter specifies which trailing dimensions of the input should be normalized over.

Usage

Apply layer normalization when:

  • Building transformer or attention-based architectures
  • Working with recurrent networks processing variable-length sequences
  • Batch statistics are unavailable or unreliable (batch size 1, online learning)
  • Consistent behavior between training and inference is required

Theoretical Basis

Normalization Computation

For an input x of shape (N,D) where N is batch size and D is the feature dimension, layer normalization computes for each sample n:

μn=1Dd=1Dxn,d

σn2=1Dd=1D(xn,dμn)2

x^n,d=γdxn,dμnσn2+ϵ+βd

where γd and βd are learnable per-feature parameters and ϵ is a small constant for numerical stability.

Multi-Dimensional Case

For higher-dimensional inputs (e.g., shape (N,S,D) in sequence models), the normalized shape specifies which trailing dimensions are normalized. If the normalized shape is (D,), then normalization is computed over the last dimension independently for each position in the sequence:

μn,s=1Dd=1Dxn,s,d

Comparison with Batch Normalization

Property Batch Normalization Layer Normalization
Statistics computed over Batch dimension Feature dimensions
Depends on batch size Yes No
Running statistics needed Yes No
Training/inference difference Yes No
Primary use case CNNs Transformers, RNNs

Pre-Norm vs Post-Norm

In transformer architectures, layer normalization placement affects training dynamics:

Post-norm: output=LayerNorm(x+SubLayer(x))

Pre-norm: output=x+SubLayer(LayerNorm(x))

Pre-norm tends to be more stable during training and often does not require learning rate warmup, while post-norm may achieve slightly better final performance with careful tuning.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment