Implementation:Microsoft Onnxruntime CUDA BatchScale
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for scaling a tensor by multiple scale factors simultaneously in the ONNX Runtime CUDA training framework.
Description
Implements the BatchScale operator for CUDA that produces 2 or 3 scaled copies of a single input tensor in one kernel launch. The operator takes a single input tensor and produces multiple output tensors, each scaled by a different float factor (scale0_, scale1_, and optionally scale2_). The BatchScaleFunctor template dispatches to BatchScaleImpl which handles the type-specific scaling on GPU. This avoids multiple separate scale operations when the same tensor needs to be scaled differently for various consumers. Supports MLFloat16, float, double, and BFloat16.
Usage
Used during training when a single tensor needs to be distributed to multiple consumers with different scaling factors, such as in gradient scaling or loss weighting scenarios.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/math/batch_scale.cc
- Lines: 1-68
Signature
class BatchScale : public CudaKernel {
Status ComputeInternal(OpKernelContext* context) const;
};
Import
#include "orttraining/training_ops/cuda/math/batch_scale.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | Tensor(T) | Yes | Input tensor to scale |
Outputs
| Name | Type | Description |
|---|---|---|
| output_0 | Tensor(T) | Input scaled by scale0_ |
| output_1 | Tensor(T) | Input scaled by scale1_ |
| output_2 | Tensor(T) | Input scaled by scale2_ (optional, only if scale2_ is set) |
Usage Examples
ONNX_OPERATOR_KERNEL_EX(
BatchScale, kMSDomain, 1, kCudaExecutionProvider,
(*KernelDefBuilder::Create())
.TypeConstraint("T", BuildKernelDefConstraints<MLFloat16, float, double, BFloat16>()),
BatchScale);