Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA TransformerEngine Fused LayerNorm MLP

From Leeroopedia


Overview

Fusing layer normalization and the entire MLP sub-layer (two linear transforms with activation) into a single optimized module.

Description

The Transformer feed-forward network (FFN) consists of a LayerNorm, a first linear projection (up-projection), an activation function, and a second linear projection (down-projection). In a naive implementation, each of these operations is a separate kernel launch with intermediate tensors written to and read from global memory.

LayerNormMLP combines all of these operations into a fused module. This eliminates multiple memory round-trips and enables FP8 quantization of the intermediate MLP tensors, which are typically the largest activations in a Transformer layer.

Key benefits of this fusion:

  • Eliminates intermediate memory traffic between LayerNorm, FC1, activation, and FC2.
  • Enables FP8 quantization of the large intermediate activation tensor (of size ffn_hidden_size, typically 4x the model hidden size).
  • Reduces kernel launch overhead by consolidating multiple operations.
  • Supports gated activations (SwiGLU, GeGLU) natively, where the first projection output is split into gate and value paths.

Theoretical Basis

The mathematical formulation of the fused MLP operation is:

MLP(x) = W2 * activation(W1 * LayerNorm(x) + b1) + b2

Step by step:

  1. LayerNorm: norm = (x - mean) / sqrt(var + eps) * gamma + beta
  2. FC1 (up-projection): h = W1 * norm + b1
  3. Activation: a = activation(h)
  4. FC2 (down-projection): y = W2 * a + b2

Supported activation functions:

Activation Functions
Activation Formula Gated?
gelu GELU(x) No
geglu GELU(x1) * x2 (GeGLU) Yes
silu SiLU(x) (also known as Swish) No
swiglu SiLU(x1) * x2 (SwiGLU) Yes
relu ReLU(x) No
srelu Squared ReLU: ReLU(x)^2 No
qgelu Quick GELU approximation No

For gated activations (SwiGLU, GeGLU), the FC1 output dimension is doubled: W1 projects to 2 * ffn_hidden_size, and the output is split into gate and value paths before the element-wise product.

Usage

Use the fused LayerNorm + MLP operation when:

  • The Transformer FFN layer follows a LayerNorm -- this is the standard architecture pattern in virtually all Transformer models.
  • You need the largest performance gain from fusion, since the MLP sub-layer typically accounts for the majority of FLOPs in a Transformer layer.
  • You want to use gated activations (SwiGLU, GeGLU) as used in LLaMA, PaLM, and other modern architectures.
  • You are enabling FP8 training and want the large intermediate activations quantized automatically.

This fusion provides the largest single-module performance improvement in the TE optimization path.

Related

Sources

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment