Implementation:Microsoft Onnxruntime CUDA BatchScale

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for scaling a tensor by multiple scale factors simultaneously in the ONNX Runtime CUDA training framework.

Description

Implements the BatchScale operator for CUDA that produces 2 or 3 scaled copies of a single input tensor in one kernel launch. The operator takes a single input tensor and produces multiple output tensors, each scaled by a different float factor (scale0_, scale1_, and optionally scale2_). The BatchScaleFunctor template dispatches to BatchScaleImpl which handles the type-specific scaling on GPU. This avoids multiple separate scale operations when the same tensor needs to be scaled differently for various consumers. Supports MLFloat16, float, double, and BFloat16.

Usage

Used during training when a single tensor needs to be distributed to multiple consumers with different scaling factors, such as in gradient scaling or loss weighting scenarios.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/math/batch_scale.cc
Lines: 1-68

Signature

class BatchScale : public CudaKernel {
  Status ComputeInternal(OpKernelContext* context) const;
};

Import

#include "orttraining/training_ops/cuda/math/batch_scale.h"

I/O Contract

Inputs

Name	Type	Required	Description
input	Tensor(T)	Yes	Input tensor to scale

Outputs

Name	Type	Description
output_0	Tensor(T)	Input scaled by scale0_
output_1	Tensor(T)	Input scaled by scale1_
output_2	Tensor(T)	Input scaled by scale2_ (optional, only if scale2_ is set)

Usage Examples

ONNX_OPERATOR_KERNEL_EX(
    BatchScale, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .TypeConstraint("T", BuildKernelDefConstraints<MLFloat16, float, double, BFloat16>()),
    BatchScale);

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment