Implementation:Microsoft Onnxruntime CUDA BiasGeluGrad

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for computing the gradient of BiasGelu and BiasFastGelu activation functions in the ONNX Runtime CUDA training framework.

Description

Implements two gradient operators for CUDA: BiasGeluGrad_dX (exact GELU gradient) and BiasFastGeluGrad_dX (approximate GELU gradient). Both are templated on GeluComputationMode (Default for exact, Approximation for fast). The ComputeInternal method validates that dY and X have the same shape, B is 1-dimensional matching the last dimension of X, then dispatches to a type-specific KernelLaunchDispatcher which calls LaunchBiasGeluGradDxKernel. The fused kernel computes the gradient of GELU(X + B) with respect to X in a single pass, combining bias addition and GELU derivative computation. Both operators support MayInplace(0, 0) for reusing the dY buffer. Supports MLFloat16, float, double, and BFloat16.

Usage

Invoked during the backward pass when the model uses fused BiasGelu or BiasFastGelu activation layers, commonly found in transformer architectures.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/activation/bias_gelu_grad.cc
Lines: 1-80

Signature

template <typename GeluComputationMode>
class BiasGeluGrad_dX : public CudaKernel {
  template <typename T>
  struct KernelLaunchDispatcher {
    void operator()(cudaStream_t stream, int64_t input_size, int64_t bias_size,
                    const Tensor& dY, const Tensor& X, const Tensor& B, Tensor& dX) const;
  };
  Status ComputeInternal(OpKernelContext* context) const;
};

Import

#include "orttraining/training_ops/cuda/activation/bias_gelu_grad.h"

I/O Contract

Inputs

Name	Type	Required	Description
dY	Tensor(T)	Yes	Upstream gradient with same shape as X
X	Tensor(T)	Yes	Original input to BiasGelu
B	Tensor(T)	Yes	Bias vector (1D, matching last dim of X)

Outputs

Name	Type	Description
dX	Tensor(T)	Gradient with respect to input (may reuse dY buffer)

Usage Examples

// Exact GELU gradient
ONNX_OPERATOR_KERNEL_EX(BiasGeluGrad_dX, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .TypeConstraint("T", BuildKernelDefConstraints<MLFloat16, float, double, BFloat16>())
        .MayInplace(0, 0),
    BiasGeluGrad_dX<gelu_computation_mode::Default>);

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment