Principle:LaurentMazare Tch rs Layer Normalization

Knowledge Sources	LaurentMazare_Tch_rs Ba et al., 2016
Domains	Deep Learning, Normalization, Natural Language Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Layer normalization computes normalization statistics across all features within a single sample, providing batch-independent normalization that has become the standard in transformer architectures.

Description

Layer normalization (LN) normalizes activations across the entire feature dimension of each individual sample, computing mean and variance over all hidden units within a layer. This contrasts with batch normalization, which computes statistics across the batch dimension for each feature.

The fundamental advantage of layer normalization is its complete independence from batch statistics. Each sample is normalized using only its own activations, which means:

The normalization behaves identically during training and inference
There are no running statistics to maintain
It works correctly with batch size of 1
It is naturally suited to variable-length sequences where different samples may have different lengths

Layer normalization has become the de facto standard in transformer architectures, where it is applied before or after each sub-layer (self-attention and feed-forward). In the "pre-norm" variant, normalization is applied before the attention/feed-forward computation, while in the "post-norm" variant, it is applied after the residual addition.

After computing normalized values, learnable scale ( $γ$ ) and shift ( $β$ ) parameters (one per feature dimension) allow the network to adaptively control the normalized distribution. The normalized shape parameter specifies which trailing dimensions of the input should be normalized over.

Usage

Apply layer normalization when:

Building transformer or attention-based architectures
Working with recurrent networks processing variable-length sequences
Batch statistics are unavailable or unreliable (batch size 1, online learning)
Consistent behavior between training and inference is required

Theoretical Basis

Normalization Computation

For an input $x$ of shape $(N, D)$ where $N$ is batch size and $D$ is the feature dimension, layer normalization computes for each sample $n$ :

$μ_{n} = \frac{1}{D} \sum_{d = 1}^{D} x_{n, d}$

$σ_{n}^{2} = \frac{1}{D} \sum_{d = 1}^{D} (x_{n, d} - μ_{n})^{2}$

${\hat{x}}_{n, d} = γ_{d} \cdot \frac{x_{n, d} - μ_{n}}{\sqrt{σ_{n}^{2} + ϵ}} + β_{d}$

where $γ_{d}$ and $β_{d}$ are learnable per-feature parameters and $ϵ$ is a small constant for numerical stability.

Multi-Dimensional Case

For higher-dimensional inputs (e.g., shape $(N, S, D)$ in sequence models), the normalized shape specifies which trailing dimensions are normalized. If the normalized shape is $(D,)$ , then normalization is computed over the last dimension independently for each position in the sequence:

$μ_{n, s} = \frac{1}{D} \sum_{d = 1}^{D} x_{n, s, d}$

Comparison with Batch Normalization

Property	Batch Normalization	Layer Normalization
Statistics computed over	Batch dimension	Feature dimensions
Depends on batch size	Yes	No
Running statistics needed	Yes	No
Training/inference difference	Yes	No
Primary use case	CNNs	Transformers, RNNs

Pre-Norm vs Post-Norm

In transformer architectures, layer normalization placement affects training dynamics:

Post-norm: $output = LayerNorm (x + SubLayer (x))$

Pre-norm: $output = x + SubLayer (LayerNorm (x))$

Pre-norm tends to be more stable during training and often does not require learning rate warmup, while post-norm may achieve slightly better final performance with careful tuning.

Related Pages

Implementation:LaurentMazare_Tch_rs_Layer_Norm

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment