Principle:LaurentMazare Tch rs Layer Normalization
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Normalization, Natural Language Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Layer normalization computes normalization statistics across all features within a single sample, providing batch-independent normalization that has become the standard in transformer architectures.
Description
Layer normalization (LN) normalizes activations across the entire feature dimension of each individual sample, computing mean and variance over all hidden units within a layer. This contrasts with batch normalization, which computes statistics across the batch dimension for each feature.
The fundamental advantage of layer normalization is its complete independence from batch statistics. Each sample is normalized using only its own activations, which means:
- The normalization behaves identically during training and inference
- There are no running statistics to maintain
- It works correctly with batch size of 1
- It is naturally suited to variable-length sequences where different samples may have different lengths
Layer normalization has become the de facto standard in transformer architectures, where it is applied before or after each sub-layer (self-attention and feed-forward). In the "pre-norm" variant, normalization is applied before the attention/feed-forward computation, while in the "post-norm" variant, it is applied after the residual addition.
After computing normalized values, learnable scale () and shift () parameters (one per feature dimension) allow the network to adaptively control the normalized distribution. The normalized shape parameter specifies which trailing dimensions of the input should be normalized over.
Usage
Apply layer normalization when:
- Building transformer or attention-based architectures
- Working with recurrent networks processing variable-length sequences
- Batch statistics are unavailable or unreliable (batch size 1, online learning)
- Consistent behavior between training and inference is required
Theoretical Basis
Normalization Computation
For an input of shape where is batch size and is the feature dimension, layer normalization computes for each sample :
where and are learnable per-feature parameters and is a small constant for numerical stability.
Multi-Dimensional Case
For higher-dimensional inputs (e.g., shape in sequence models), the normalized shape specifies which trailing dimensions are normalized. If the normalized shape is , then normalization is computed over the last dimension independently for each position in the sequence:
Comparison with Batch Normalization
| Property | Batch Normalization | Layer Normalization |
|---|---|---|
| Statistics computed over | Batch dimension | Feature dimensions |
| Depends on batch size | Yes | No |
| Running statistics needed | Yes | No |
| Training/inference difference | Yes | No |
| Primary use case | CNNs | Transformers, RNNs |
Pre-Norm vs Post-Norm
In transformer architectures, layer normalization placement affects training dynamics:
Post-norm:
Pre-norm:
Pre-norm tends to be more stable during training and often does not require learning rate warmup, while post-norm may achieve slightly better final performance with careful tuning.