Implementation:Microsoft Onnxruntime CPU LayerNormGrad
| Knowledge Sources | |
|---|---|
| Domains | Training, CPU_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for computing layer normalization gradients on CPU in the ONNX Runtime training framework.
Description
This file implements three layer normalization gradient kernels: LayerNormGrad (standard), SimplifiedLayerNormalizationGrad (simplified, without bias), and InvertibleLayerNormGrad (which recovers X from Y, scale, and bias without needing the original input). All are registered under kMSDomain opset 1 for float and double types.
The standard LayerNormGrad computes gradients using intermediate arrays A, B, and C: A = dY * (X - mean) * inv_std_var, B = dY * scale * inv_std_var, C = B * (X - mean) * inv_std_var. The input gradient is dX = B - mean(B) - (X - mean) * inv_std_var * mean(C). Scale gradient is d_scale = sum(A) and bias gradient is d_bias = sum(dY). The simplified variant omits the mean subtraction and bias gradient. The invertible variant recovers X from the forward output Y using (Y - bias) / scale.
Usage
These kernels are invoked during the backward pass of layer normalization operations. They are commonly used in transformer architectures for training.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cpu/nn/layer_norm.cc
- Lines: 1-199
Signature
template <typename T, bool simplified>
LayerNormGrad<T, simplified>::LayerNormGrad(const OpKernelInfo& op_kernel_info);
template <typename T, bool simplified>
Status LayerNormGrad<T, simplified>::Compute(OpKernelContext* op_kernel_context) const;
template <typename T>
InvertibleLayerNormGrad<T>::InvertibleLayerNormGrad(const OpKernelInfo& op_kernel_info);
template <typename T>
Status InvertibleLayerNormGrad<T>::Compute(OpKernelContext* op_kernel_context) const;
Import
#include "orttraining/orttraining/training_ops/cpu/nn/layer_norm.h"
I/O Contract
Inputs (LayerNormGrad)
| Name | Type | Required | Description |
|---|---|---|---|
| Y_grad | Tensor(T) | Yes | Upstream gradient [N, M] |
| X | Tensor(T) | Yes | Input tensor from forward [N, M] |
| scale | Tensor(T) | Yes | Scale parameter [M] |
| mean | Tensor(float) | Yes (std) / No (simplified) | Saved mean [N] |
| inv_std_var | Tensor(float) | Yes | Saved inverse standard deviation [N] |
Outputs (LayerNormGrad)
| Name | Type | Description |
|---|---|---|
| X_grad | Tensor(T) | Gradient w.r.t. input X |
| scale_grad | Tensor(T) | Gradient w.r.t. scale |
| bias_grad | Tensor(T) | Gradient w.r.t. bias (not produced in simplified mode) |
Inputs (InvertibleLayerNormGrad)
| Name | Type | Required | Description |
|---|---|---|---|
| Y_grad | Tensor(T) | Yes | Upstream gradient |
| Y | Tensor(T) | Yes | Output from forward pass |
| scale | Tensor(T) | Yes | Scale parameter |
| bias | Tensor(T) | Yes | Bias parameter |
| inv_std_var | Tensor(float) | Yes | Saved inverse standard deviation |
Outputs (InvertibleLayerNormGrad)
| Name | Type | Description |
|---|---|---|
| X_grad | Tensor(T) | Gradient w.r.t. input X |
| scale_grad | Tensor(T) | Gradient w.r.t. scale |
| bias_grad | Tensor(T) | Gradient w.r.t. bias |
Usage Examples
ONNX_OPERATOR_TYPED_KERNEL_EX(
LayerNormalizationGrad, kMSDomain, 1, float, kCpuExecutionProvider,
KernelDefBuilder()
.TypeConstraint("T", DataTypeImpl::GetTensorType<float>())
.TypeConstraint("U", DataTypeImpl::GetTensorType<float>())
.TypeConstraint("V", DataTypeImpl::GetTensorType<float>()),
LayerNormGrad<float, false>);