Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine Normalization Common

From Leeroopedia


Field Value
Sources TransformerEngine
Domains Deep_Learning, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Implements the shared normalization infrastructure used by both LayerNorm and RMSNorm, including execution plan construction, kernel dispatch, and cuDNN backend integration.

Description

normalization/common.cpp provides TeNormalizationPlan and CudnnNormalizationPlan template classes that encapsulate kernel selection, workspace management, and execution:

  • Composite key system: Encodes norm type, data types (wtype, itype, otype, ctype), batch/hidden sizes, scaling mode, and other flags into a tuple key for kernel registry lookup.
  • TE backend: Sets kernel launch parameters (rows, cols, epsilon, pointers) and dispatches through function pointers registered in KernelRegistry.
  • cuDNN backend: Builds cuDNN frontend execution graphs for normalization.
  • Plan registry: NormalizationPlanRegistry singleton caches plans by key to avoid repeated setup.
  • Type support: FP32, FP16, BF16 inputs with FP32 compute, and optional FP8 output.

Usage

This module is called internally by the LayerNorm and RMSNorm API entry points. It should not typically be used directly.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/common/normalization/common.cpp
Lines
1--558

Signature

namespace transformer_engine { namespace normalization {

template <typename KernelParamsType>
class TeNormalizationPlan : public NormalizationPlanBase {
public:
  TeNormalizationPlan(NVTE_Norm_Type, NVTE_Norm_Stage, DType wtype,
                      DType itype, DType otype, DType ctype,
                      size_t batch_size, size_t hidden_size,
                      size_t sm_count, bool zero_centered_gamma, bool is_tuned);
  void execute(Tensor* z, void* x_dptr, void* gamma_dptr, ...);
};

TupleKeyType get_key(NVTE_Norm_Backend, NVTE_Norm_Type, NVTE_Norm_Stage, ...);

}}  // namespace

Import

#include "normalization/common.h"

I/O Contract

Inputs

Name Type Required Description
x void* Yes Input data pointer
gamma void* Yes Gamma weight pointer
epsilon float Yes Numerical stability epsilon
stream cudaStream_t Yes CUDA stream

Outputs

Name Type Description
z Tensor* Normalized output tensor (optionally FP8)
rsigma void* Inverse standard deviation for backward pass

Usage Examples

// Internal usage via NormalizationPlanRegistry
auto plan = NormalizationPlanRegistry::getInstance().getNormalizationPlan(
    NVTE_Norm_Backend::Te, NVTE_Norm_Type::LayerNorm,
    NVTE_Norm_Stage::Forward, wtype, itype, otype, ctype,
    batch_size, hidden_size, sm_count, zero_centered_gamma,
    is_tuned, scaling_mode, training, gamma_in_weight_dtype);
plan->execute(z, x_dptr, gamma_dptr, beta_dptr, mean_dptr,
              eps_dptr, rsigma_dptr, workspace_dptr, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment