Implementation:Microsoft Onnxruntime CUDA GradientControl

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, CUDA_Kernels
Last Updated	2026-02-10 04:00 GMT

Overview

Concrete tool for gradient accumulation and zeroing operations in the ONNX Runtime CUDA training framework.

Description

Implements three gradient control operators for CUDA: (1) ZeroGradient zeros out gradient tensors in-place using cudaMemsetAsync, registered for float and MLFloat16; (2) InPlaceAccumulator accumulates gradients in-place by adding a gradient tensor to an existing buffer using InPlaceAccumulatorImpl, with an optional do_update CPU boolean to skip accumulation during gradient accumulation steps. Supports mixed-precision pairs (float/float, float/MLFloat16, MLFloat16/MLFloat16, MLFloat16/float, float/BFloat16, BFloat16/BFloat16, BFloat16/float); (3) InPlaceAccumulatorV2 extends accumulation with an overwrite mode that either replaces or adds to the buffer, with type casting when types differ (Impl_Cast), and outputs an updated flag on CPU. Both accumulator variants use in-place aliasing for efficiency.

Usage

Used during gradient accumulation in distributed training to aggregate gradients across micro-batches before applying the optimizer step.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/training_ops/cuda/optimizer/gradient_control.cc
Lines: 1-157

Signature

template <typename T>
class ZeroGradient : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

template <typename T, typename T_GRAD>
class InPlaceAccumulator : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

template <typename T, typename T_GRAD>
class InPlaceAccumulatorV2 : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/optimizer/gradient_control.h"

I/O Contract

Inputs

Name	Type	Required	Description
old_gradient/left_addee	Tensor(T)	Yes	Existing gradient buffer (aliased to output)
right_addee	Tensor(T_GRAD)	Yes	Gradient to accumulate (InPlaceAccumulator only)
do_update/overwrite	Tensor(bool)	No	Control flag (CPU memory)

Outputs

Name	Type	Description
output	Tensor(T)	Zeroed or accumulated gradient (in-place)
updated_flag	Tensor(bool)	Whether update occurred (InPlaceAccumulatorV2 only, CPU)

Usage Examples

// ZeroGradient: zeros gradient buffer in-place
REGISTER_ZERO_GRADIENT_TYPED(float)

// InPlaceAccumulator: accumulates gradients with optional update control
REGISTER_IN_PLACE_TENSOR_ACCUMULATOR_TYPED(float, MLFloat16)

// InPlaceAccumulatorV2: accumulates with overwrite mode support
REGISTER_IN_PLACE_TENSOR_ACCUMULATORV2_TYPED(float, float)

Related Pages

Environment:Microsoft_Onnxruntime_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment