Implementation:NVIDIA TransformerEngine Normalization Common
Appearance
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Implements the shared normalization infrastructure used by both LayerNorm and RMSNorm, including execution plan construction, kernel dispatch, and cuDNN backend integration.
Description
normalization/common.cpp provides TeNormalizationPlan and CudnnNormalizationPlan template classes that encapsulate kernel selection, workspace management, and execution:
- Composite key system: Encodes norm type, data types (wtype, itype, otype, ctype), batch/hidden sizes, scaling mode, and other flags into a tuple key for kernel registry lookup.
- TE backend: Sets kernel launch parameters (rows, cols, epsilon, pointers) and dispatches through function pointers registered in
KernelRegistry. - cuDNN backend: Builds cuDNN frontend execution graphs for normalization.
- Plan registry:
NormalizationPlanRegistrysingleton caches plans by key to avoid repeated setup. - Type support: FP32, FP16, BF16 inputs with FP32 compute, and optional FP8 output.
Usage
This module is called internally by the LayerNorm and RMSNorm API entry points. It should not typically be used directly.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/common/normalization/common.cpp- Lines
- 1--558
Signature
namespace transformer_engine { namespace normalization {
template <typename KernelParamsType>
class TeNormalizationPlan : public NormalizationPlanBase {
public:
TeNormalizationPlan(NVTE_Norm_Type, NVTE_Norm_Stage, DType wtype,
DType itype, DType otype, DType ctype,
size_t batch_size, size_t hidden_size,
size_t sm_count, bool zero_centered_gamma, bool is_tuned);
void execute(Tensor* z, void* x_dptr, void* gamma_dptr, ...);
};
TupleKeyType get_key(NVTE_Norm_Backend, NVTE_Norm_Type, NVTE_Norm_Stage, ...);
}} // namespace
Import
#include "normalization/common.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
x |
void* |
Yes | Input data pointer |
gamma |
void* |
Yes | Gamma weight pointer |
epsilon |
float |
Yes | Numerical stability epsilon |
stream |
cudaStream_t |
Yes | CUDA stream |
Outputs
| Name | Type | Description |
|---|---|---|
z |
Tensor* |
Normalized output tensor (optionally FP8) |
rsigma |
void* |
Inverse standard deviation for backward pass |
Usage Examples
// Internal usage via NormalizationPlanRegistry
auto plan = NormalizationPlanRegistry::getInstance().getNormalizationPlan(
NVTE_Norm_Backend::Te, NVTE_Norm_Type::LayerNorm,
NVTE_Norm_Stage::Forward, wtype, itype, otype, ctype,
batch_size, hidden_size, sm_count, zero_centered_gamma,
is_tuned, scaling_mode, training, gamma_in_weight_dtype);
plan->execute(z, x_dptr, gamma_dptr, beta_dptr, mean_dptr,
eps_dptr, rsigma_dptr, workspace_dptr, stream);
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment