Implementation:NVIDIA TransformerEngine Normalization Common

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Implements the shared normalization infrastructure used by both LayerNorm and RMSNorm, including execution plan construction, kernel dispatch, and cuDNN backend integration.

Description

normalization/common.cpp provides TeNormalizationPlan and CudnnNormalizationPlan template classes that encapsulate kernel selection, workspace management, and execution:

Composite key system: Encodes norm type, data types (wtype, itype, otype, ctype), batch/hidden sizes, scaling mode, and other flags into a tuple key for kernel registry lookup.
TE backend: Sets kernel launch parameters (rows, cols, epsilon, pointers) and dispatches through function pointers registered in KernelRegistry.
cuDNN backend: Builds cuDNN frontend execution graphs for normalization.
Plan registry: NormalizationPlanRegistry singleton caches plans by key to avoid repeated setup.
Type support: FP32, FP16, BF16 inputs with FP32 compute, and optional FP8 output.

Usage

This module is called internally by the LayerNorm and RMSNorm API entry points. It should not typically be used directly.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/common/normalization/common.cpp
Lines: 1--558

Signature

namespace transformer_engine { namespace normalization {

template <typename KernelParamsType>
class TeNormalizationPlan : public NormalizationPlanBase {
public:
  TeNormalizationPlan(NVTE_Norm_Type, NVTE_Norm_Stage, DType wtype,
                      DType itype, DType otype, DType ctype,
                      size_t batch_size, size_t hidden_size,
                      size_t sm_count, bool zero_centered_gamma, bool is_tuned);
  void execute(Tensor* z, void* x_dptr, void* gamma_dptr, ...);
};

TupleKeyType get_key(NVTE_Norm_Backend, NVTE_Norm_Type, NVTE_Norm_Stage, ...);

}}  // namespace

Import

#include "normalization/common.h"

I/O Contract

Inputs

Name	Type	Required	Description
`x`	`void*`	Yes	Input data pointer
`gamma`	`void*`	Yes	Gamma weight pointer
`epsilon`	`float`	Yes	Numerical stability epsilon
`stream`	`cudaStream_t`	Yes	CUDA stream

Outputs

Name	Type	Description
`z`	`Tensor*`	Normalized output tensor (optionally FP8)
`rsigma`	`void*`	Inverse standard deviation for backward pass

Usage Examples

// Internal usage via NormalizationPlanRegistry
auto plan = NormalizationPlanRegistry::getInstance().getNormalizationPlan(
    NVTE_Norm_Backend::Te, NVTE_Norm_Type::LayerNorm,
    NVTE_Norm_Stage::Forward, wtype, itype, otype, ctype,
    batch_size, hidden_size, sm_count, zero_centered_gamma,
    is_tuned, scaling_mode, training, gamma_in_weight_dtype);
plan->execute(z, x_dptr, gamma_dptr, beta_dptr, mean_dptr,
              eps_dptr, rsigma_dptr, workspace_dptr, stream);

Related Pages

Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment