Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA TransformerEngine TE LayerNorm

From Leeroopedia


Field Value
Sources TransformerEngine, Layer Normalization
Domains Deep_Learning, Normalization
Last Updated 2026-02-07 14:00 GMT

Overview

te.LayerNorm is a concrete tool for hardware-accelerated layer normalization provided by NVIDIA's TransformerEngine library. It replaces torch.nn.LayerNorm with a fused CUDA kernel implementation that supports zero-centered gamma initialization and optional FP8 output casting.

Description

te.LayerNorm applies layer normalization over a mini-batch of inputs as described in the Layer Normalization paper. The computation normalizes across the inner-most dimensions specified by normalized_shape:

y = (x - E[x]) / sqrt(Var[x] + eps) * gamma + beta

The class inherits from the internal _LayerNormOp (located in transformer_engine.pytorch.ops), which implements the fused CUDA kernel dispatch. The public LayerNorm class adds:

  • Legacy parameter handling: Supports the deprecated hidden_size argument (renamed to normalized_shape) and the deprecated params_dtype argument (renamed to dtype) for backward compatibility with older Megatron-LM integration code.
  • Sequence parallelism flag: Sets a sequence_parallel attribute on the weight and bias parameters for custom Megatron-LM integration logic.
  • SM margin control: Configurable sm_margin parameter to reserve streaming multiprocessors for concurrent operations such as communication kernels.

Zero-Centered Gamma

When zero_centered_gamma=True, gamma is initialized to zero and the computation becomes:

y = (x - E[x]) / sqrt(Var[x] + eps) * (1 + gamma) + beta

This means the initial forward pass performs pure normalization (identity scaling), which can improve training stability for deep models.

Usage

Import te.LayerNorm when replacing torch.nn.LayerNorm in models that will be trained with TransformerEngine's FP8 autocast or when fused normalization kernels are desired for performance. It is a direct drop-in replacement with compatible constructor arguments.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/pytorch/module/layernorm.py
Class
LayerNorm
Lines
__init__ at L59--68

Signature

class LayerNorm(torch.nn.Module):
    def __init__(
        self,
        normalized_shape: Union[Iterable[int], int, None] = None,
        eps: float = 1e-5,
        sequence_parallel: Optional[bool] = None,  # legacy
        params_dtype: Optional[torch.dtype] = None,  # deprecated
        zero_centered_gamma: bool = False,
        hidden_size: Optional[int] = None,  # deprecated
        **kwargs,
    ) -> None:

Import

from transformer_engine.pytorch import LayerNorm

# or equivalently:
import transformer_engine.pytorch as te
te.LayerNorm

I/O Contract

Inputs

Name Type Required Description
input torch.Tensor Yes Input tensor of any shape; normalization is applied over the last D dimensions matching normalized_shape

Outputs

Name Type Description
output torch.Tensor Normalized tensor of the same shape as input

Key Parameters

Parameter Type Default Description
normalized_shape int or iterable of int required Inner dimensions of the input tensor over which to normalize
eps float 1e-5 Small constant added to the denominator for numerical stability
zero_centered_gamma bool False If True, gamma is initialized to zero and the formula uses (1 + gamma) scaling
device torch.device default CUDA device Device on which to allocate the learnable parameters (passed via **kwargs)
dtype torch.dtype default dtype Data type of the learnable parameters (passed via **kwargs)
sm_margin int or dict 0 Number of SMs to exclude from kernel launches; accepts a dict with keys "forward", "backward", "inference" for fine-grained control (passed via **kwargs)

Usage Examples

Basic Drop-in Replacement

import torch
import transformer_engine.pytorch as te

# Before: standard PyTorch
# layer_norm = torch.nn.LayerNorm(768)

# After: TransformerEngine drop-in replacement
layer_norm = te.LayerNorm(768)

# Usage is identical
output = layer_norm(input_tensor)

With Zero-Centered Gamma

import transformer_engine.pytorch as te

# Zero-centered gamma for improved training stability
layer_norm = te.LayerNorm(768, zero_centered_gamma=True)

# gamma starts at zero, so initial output is pure normalization
output = layer_norm(input_tensor)

Inside FP8 Training

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling

layer_norm = te.LayerNorm(768)
linear = te.Linear(768, 3072)

# LayerNorm output can be directly consumed by FP8 linear
with te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling()):
    normed = layer_norm(input_tensor)
    output = linear(normed)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment