Implementation:Microsoft Onnxruntime CPU ActivationsGrad

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CPU_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for computing activation function gradients (GELU, FastGELU, BiasGELU, BiasFastGELU) on CPU in the ONNX Runtime training framework.

Description

This file implements four activation gradient kernels: GeluGrad, FastGeluGrad, BiasGeluGrad_dX, and BiasFastGeluGrad_dX. The implementation supports two GELU computation modes: the exact (Default) mode using erf and the Approximation mode using tanh.

For GeluGrad (Default mode), the gradient is: dX = dY * (0.5 * (erf(X/sqrt(2)) + 1) + X * alpha * exp(-0.5 * X^2)) where alpha = 2/sqrt(pi) * 1/sqrt(2) * 0.5.

For FastGeluGrad (Approximation mode), a loop-based implementation is used instead of Eigen to work around an Eigen bug in Windows Release builds with GPU enabled. It computes: dX = dY * 0.5 * (tanh(alpha * X + alpha * gamma * X^3) + sech^2(...) * (alpha * X + beta * X^3) + 1).

The BiasGeluGrad_dX variants first compute X + B (with broadcasting), then apply the corresponding GELU gradient computation on the biased input.

Usage

These kernels are invoked during the backward pass of GELU activation layers, which are commonly used in transformer models (BERT, GPT). The bias variants handle fused bias+activation patterns.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cpu/activation/activations_grad.cc
Lines: 1-175

Signature

template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
                         gsl::span<T> dX, gelu_computation_mode::Default);

template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
                         gsl::span<T> dX, gelu_computation_mode::Approximation);

template <typename T, typename GeluComputationMode>
Status GeluGrad<T, GeluComputationMode>::Compute(OpKernelContext* context) const;

template <typename T, typename GeluComputationMode>
Status BiasGeluGrad_dX<T, GeluComputationMode>::Compute(OpKernelContext* context) const;

Import

#include "orttraining/orttraining/training_ops/cpu/activation/activations_grad.h"

I/O Contract

Inputs (GeluGrad)

Name	Type	Required	Description
dY	Tensor(float)	Yes	Upstream gradient
X	Tensor(float)	Yes	Input tensor from forward pass

Outputs (GeluGrad)

Name	Type	Description
dX	Tensor(float)	Gradient w.r.t. input X

Inputs (BiasGeluGrad_dX)

Name	Type	Required	Description
dY	Tensor(float)	Yes	Upstream gradient
X	Tensor(float)	Yes	Input tensor
B	Tensor(float)	Yes	Bias (1D, matching last dimension of X)

Outputs (BiasGeluGrad_dX)

Name	Type	Description
dX	Tensor(float)	Gradient w.r.t. input X

Usage Examples

ONNX_OPERATOR_KERNEL_EX(
    GeluGrad, kMSDomain, 1, kCpuExecutionProvider,
    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    GeluGrad<float, gelu_computation_mode::Default>);

ONNX_OPERATOR_KERNEL_EX(
    FastGeluGrad, kMSDomain, 1, kCpuExecutionProvider,
    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    GeluGrad<float, gelu_computation_mode::Approximation>);

Related Pages

Environment:Microsoft_Onnxruntime_CPU_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment