Implementation:Microsoft Onnxruntime CrossEntropy Declarations

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, Operators, Loss
Last Updated	2026-02-10 04:00 GMT

Overview

Declares CPU kernel classes for cross-entropy loss functions and their gradients used in ORT Training, including SoftmaxCrossEntropy and SparseSoftmaxCrossEntropy variants.

Description

The `cross_entropy.h` header declares the cross-entropy loss operator kernels for the ONNX Runtime training operators (in the `onnxruntime::contrib` namespace). These are CPU implementations used during training for computing loss and backpropagating gradients.

LossBase: Abstract base class extending `OpKernel`. Extracts the `reduction` attribute (mean, sum, or none) from the operator info and stores it as a `ReductionType` enum. All loss kernels inherit from this.

ComputeShareSoftmaxCrossEntropyCPU<T>: A free function template that computes the shared softmax and log-probability computation. Takes raw logit data, computes shifted logits (for numerical stability), and produces log probabilities. Parameters: `n` (batch size), `d` (class count), `nd` (total elements), and pre-allocated buffers for `shifted_logit` and `log_prob_data`.

SoftmaxCrossEntropy<T>: Computes softmax cross-entropy loss where both the predictions (logits) and targets are dense tensors. The `Compute` method applies softmax to logits and computes cross-entropy with the target distribution. Non-copyable, non-movable.

SoftmaxCrossEntropyGrad<T>: Computes the gradient of the softmax cross-entropy loss with respect to the logits. Non-copyable, non-movable.

SparseSoftmaxCrossEntropy<T>: Computes softmax cross-entropy loss where the targets are sparse (class indices rather than one-hot vectors). More memory-efficient for classification tasks with many classes. Non-copyable, non-movable.

SparseSoftmaxCrossEntropyGrad<T>: Computes the gradient of the sparse softmax cross-entropy loss. Non-copyable, non-movable.

All kernel classes are templated on the data type `T` (typically `float`) and implement the `Compute(OpKernelContext*)` method.

Usage

These kernels are registered as ORT contrib operators and are automatically invoked during training graph execution when the training graph contains SoftmaxCrossEntropy or SparseSoftmaxCrossEntropy nodes.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cpu/loss/cross_entropy.h
Lines: 1-81

Signature

namespace onnxruntime::contrib {

class LossBase : public OpKernel {
 public:
  explicit LossBase(const OpKernelInfo& info);
 protected:
  ReductionType reduction_;
};

template <typename T>
void ComputeShareSoftmaxCrossEntropyCPU(const int n, const int d,
    const Eigen::Index nd, const T* logit_data,
    T* shifted_logit, T* log_prob_data);

template <typename T>
class SoftmaxCrossEntropy final : public LossBase {
 public:
  explicit SoftmaxCrossEntropy(const OpKernelInfo& info);
  Status Compute(OpKernelContext* context) const override;
};

template <typename T>
class SoftmaxCrossEntropyGrad final : public LossBase {
 public:
  explicit SoftmaxCrossEntropyGrad(const OpKernelInfo& info);
  Status Compute(OpKernelContext* context) const override;
};

template <typename T>
class SparseSoftmaxCrossEntropy final : public LossBase {
 public:
  explicit SparseSoftmaxCrossEntropy(const OpKernelInfo& info);
  Status Compute(OpKernelContext* context) const override;
};

template <typename T>
class SparseSoftmaxCrossEntropyGrad final : public LossBase {
 public:
  explicit SparseSoftmaxCrossEntropyGrad(const OpKernelInfo& info);
  Status Compute(OpKernelContext* context) const override;
};

}  // namespace onnxruntime::contrib

Import

#include "orttraining/training_ops/cpu/loss/cross_entropy.h"

I/O Contract

Kernel	Inputs	Outputs	Description
SoftmaxCrossEntropy	logits (N,D), targets (N,D)	loss (scalar or N), log_prob (N,D)	Computes softmax CE loss with dense targets
SoftmaxCrossEntropyGrad	grad_output, log_prob, targets	grad_logits (N,D)	Gradient of softmax CE w.r.t. logits
SparseSoftmaxCrossEntropy	logits (N,D), labels (N)	loss (scalar or N), log_prob (N,D)	Computes softmax CE loss with sparse (index) targets
SparseSoftmaxCrossEntropyGrad	grad_output, log_prob, labels	grad_logits (N,D)	Gradient of sparse softmax CE w.r.t. logits

Attribute	Type	Values	Description
reduction	string	"mean", "sum", "none"	How to reduce the loss over the batch dimension

Usage Examples

// These kernels are registered as contrib operators and invoked automatically.
// Example registration (in operator registration code):

// SoftmaxCrossEntropy is used for dense label targets:
//   Input 0: logits [batch_size, num_classes]
//   Input 1: targets [batch_size, num_classes]  (probability distribution)
//   Output 0: loss [1] (if reduction="mean" or "sum") or [batch_size]
//   Output 1: log_prob [batch_size, num_classes]

// SparseSoftmaxCrossEntropy is used for integer class labels:
//   Input 0: logits [batch_size, num_classes]
//   Input 1: labels [batch_size] (integer class indices)
//   Output 0: loss
//   Output 1: log_prob [batch_size, num_classes]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment