Implementation:Microsoft Onnxruntime CUDA GradientControl
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for gradient accumulation and zeroing operations in the ONNX Runtime CUDA training framework.
Description
Implements three gradient control operators for CUDA: (1) ZeroGradient zeros out gradient tensors in-place using cudaMemsetAsync, registered for float and MLFloat16; (2) InPlaceAccumulator accumulates gradients in-place by adding a gradient tensor to an existing buffer using InPlaceAccumulatorImpl, with an optional do_update CPU boolean to skip accumulation during gradient accumulation steps. Supports mixed-precision pairs (float/float, float/MLFloat16, MLFloat16/MLFloat16, MLFloat16/float, float/BFloat16, BFloat16/BFloat16, BFloat16/float); (3) InPlaceAccumulatorV2 extends accumulation with an overwrite mode that either replaces or adds to the buffer, with type casting when types differ (Impl_Cast), and outputs an updated flag on CPU. Both accumulator variants use in-place aliasing for efficiency.
Usage
Used during gradient accumulation in distributed training to aggregate gradients across micro-batches before applying the optimizer step.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/optimizer/gradient_control.cc
- Lines: 1-157
Signature
template <typename T>
class ZeroGradient : public CudaKernel {
Status ComputeInternal(OpKernelContext* ctx) const;
};
template <typename T, typename T_GRAD>
class InPlaceAccumulator : public CudaKernel {
Status ComputeInternal(OpKernelContext* ctx) const;
};
template <typename T, typename T_GRAD>
class InPlaceAccumulatorV2 : public CudaKernel {
Status ComputeInternal(OpKernelContext* ctx) const;
};
Import
#include "orttraining/training_ops/cuda/optimizer/gradient_control.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| old_gradient/left_addee | Tensor(T) | Yes | Existing gradient buffer (aliased to output) |
| right_addee | Tensor(T_GRAD) | Yes | Gradient to accumulate (InPlaceAccumulator only) |
| do_update/overwrite | Tensor(bool) | No | Control flag (CPU memory) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tensor(T) | Zeroed or accumulated gradient (in-place) |
| updated_flag | Tensor(bool) | Whether update occurred (InPlaceAccumulatorV2 only, CPU) |
Usage Examples
// ZeroGradient: zeros gradient buffer in-place
REGISTER_ZERO_GRADIENT_TYPED(float)
// InPlaceAccumulator: accumulates gradients with optional update control
REGISTER_IN_PLACE_TENSOR_ACCUMULATOR_TYPED(float, MLFloat16)
// InPlaceAccumulatorV2: accumulates with overwrite mode support
REGISTER_IN_PLACE_TENSOR_ACCUMULATORV2_TYPED(float, float)