Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft Onnxruntime CPU ActivationsGrad

From Leeroopedia


Knowledge Sources
Domains Training, CPU_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for computing activation function gradients (GELU, FastGELU, BiasGELU, BiasFastGELU) on CPU in the ONNX Runtime training framework.

Description

This file implements four activation gradient kernels: GeluGrad, FastGeluGrad, BiasGeluGrad_dX, and BiasFastGeluGrad_dX. The implementation supports two GELU computation modes: the exact (Default) mode using erf and the Approximation mode using tanh.

For GeluGrad (Default mode), the gradient is: dX = dY * (0.5 * (erf(X/sqrt(2)) + 1) + X * alpha * exp(-0.5 * X^2)) where alpha = 2/sqrt(pi) * 1/sqrt(2) * 0.5.

For FastGeluGrad (Approximation mode), a loop-based implementation is used instead of Eigen to work around an Eigen bug in Windows Release builds with GPU enabled. It computes: dX = dY * 0.5 * (tanh(alpha * X + alpha * gamma * X^3) + sech^2(...) * (alpha * X + beta * X^3) + 1).

The BiasGeluGrad_dX variants first compute X + B (with broadcasting), then apply the corresponding GELU gradient computation on the biased input.

Usage

These kernels are invoked during the backward pass of GELU activation layers, which are commonly used in transformer models (BERT, GPT). The bias variants handle fused bias+activation patterns.

Code Reference

Source Location

Signature

template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
                         gsl::span<T> dX, gelu_computation_mode::Default);

template <typename T>
Status ComputeGeluGradDX(gsl::span<const T> dY, gsl::span<const T> X,
                         gsl::span<T> dX, gelu_computation_mode::Approximation);

template <typename T, typename GeluComputationMode>
Status GeluGrad<T, GeluComputationMode>::Compute(OpKernelContext* context) const;

template <typename T, typename GeluComputationMode>
Status BiasGeluGrad_dX<T, GeluComputationMode>::Compute(OpKernelContext* context) const;

Import

#include "orttraining/orttraining/training_ops/cpu/activation/activations_grad.h"

I/O Contract

Inputs (GeluGrad)

Name Type Required Description
dY Tensor(float) Yes Upstream gradient
X Tensor(float) Yes Input tensor from forward pass

Outputs (GeluGrad)

Name Type Description
dX Tensor(float) Gradient w.r.t. input X

Inputs (BiasGeluGrad_dX)

Name Type Required Description
dY Tensor(float) Yes Upstream gradient
X Tensor(float) Yes Input tensor
B Tensor(float) Yes Bias (1D, matching last dimension of X)

Outputs (BiasGeluGrad_dX)

Name Type Description
dX Tensor(float) Gradient w.r.t. input X

Usage Examples

ONNX_OPERATOR_KERNEL_EX(
    GeluGrad, kMSDomain, 1, kCpuExecutionProvider,
    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    GeluGrad<float, gelu_computation_mode::Default>);

ONNX_OPERATOR_KERNEL_EX(
    FastGeluGrad, kMSDomain, 1, kCpuExecutionProvider,
    KernelDefBuilder().TypeConstraint("T", DataTypeImpl::GetTensorType<float>()),
    GeluGrad<float, gelu_computation_mode::Approximation>);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment