Implementation:Microsoft Onnxruntime CPU ActivationsGrad
| Knowledge Sources | |
|---|---|
| Domains | Training, CPU_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for computing activation function gradients (GELU, FastGELU, BiasGELU, BiasFastGELU) on CPU in the ONNX Runtime training framework.
Description
This file implements four activation gradient kernels: GeluGrad, FastGeluGrad, BiasGeluGrad_dX, and BiasFastGeluGrad_dX. The implementation supports two GELU computation modes: the exact (Default) mode using erf and the Approximation mode using tanh.
For GeluGrad (Default mode), the gradient is: dX = dY * (0.5 * (erf(X/sqrt(2)) + 1) + X * alpha * exp(-0.5 * X^2)) where alpha = 2/sqrt(pi) * 1/sqrt(2) * 0.5.
For FastGeluGrad (Approximation mode), a loop-based implementation is used instead of Eigen to work around an Eigen bug in Windows Release builds with GPU enabled. It computes: dX = dY * 0.5 * (tanh(alpha * X + alpha * gamma * X^3) + sech^2(...) * (alpha * X + beta * X^3) + 1).
The BiasGeluGrad_dX variants first compute X + B (with broadcasting), then apply the corresponding GELU gradient computation on the biased input.
Usage
These kernels are invoked during the backward pass of GELU activation layers, which are commonly used in transformer models (BERT, GPT). The bias variants handle fused bias+activation patterns.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cpu/activation/activations_grad.cc
- Lines: 1-175
Signature
template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
gsl::span<T> dX, gelu_computation_mode::Default);
template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
gsl::span<T> dX, gelu_computation_mode::Approximation);
template <typename T, typename GeluComputationMode>
Status GeluGrad<T, GeluComputationMode>::Compute(OpKernelContext* context) const;
template <typename T, typename GeluComputationMode>
Status BiasGeluGrad_dX<T, GeluComputationMode>::Compute(OpKernelContext* context) const;
Import
#include "orttraining/orttraining/training_ops/cpu/activation/activations_grad.h"
I/O Contract
Inputs (GeluGrad)
| Name | Type | Required | Description |
|---|---|---|---|
| dY | Tensor(float) | Yes | Upstream gradient |
| X | Tensor(float) | Yes | Input tensor from forward pass |
Outputs (GeluGrad)
| Name | Type | Description |
|---|---|---|
| dX | Tensor(float) | Gradient w.r.t. input X |
Inputs (BiasGeluGrad_dX)
| Name | Type | Required | Description |
|---|---|---|---|
| dY | Tensor(float) | Yes | Upstream gradient |
| X | Tensor(float) | Yes | Input tensor |
| B | Tensor(float) | Yes | Bias (1D, matching last dimension of X) |
Outputs (BiasGeluGrad_dX)
| Name | Type | Description |
|---|---|---|
| dX | Tensor(float) | Gradient w.r.t. input X |
Usage Examples
ONNX_OPERATOR_KERNEL_EX(
GeluGrad, kMSDomain, 1, kCpuExecutionProvider,
KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
GeluGrad<float, gelu_computation_mode::Default>);
ONNX_OPERATOR_KERNEL_EX(
FastGeluGrad, kMSDomain, 1, kCpuExecutionProvider,
KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
GeluGrad<float, gelu_computation_mode::Approximation>);