Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA GradientControl

From Leeroopedia


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for gradient accumulation and zeroing operations in the ONNX Runtime CUDA training framework.

Description

Implements three gradient control operators for CUDA: (1) ZeroGradient zeros out gradient tensors in-place using cudaMemsetAsync, registered for float and MLFloat16; (2) InPlaceAccumulator accumulates gradients in-place by adding a gradient tensor to an existing buffer using InPlaceAccumulatorImpl, with an optional do_update CPU boolean to skip accumulation during gradient accumulation steps. Supports mixed-precision pairs (float/float, float/MLFloat16, MLFloat16/MLFloat16, MLFloat16/float, float/BFloat16, BFloat16/BFloat16, BFloat16/float); (3) InPlaceAccumulatorV2 extends accumulation with an overwrite mode that either replaces or adds to the buffer, with type casting when types differ (Impl_Cast), and outputs an updated flag on CPU. Both accumulator variants use in-place aliasing for efficiency.

Usage

Used during gradient accumulation in distributed training to aggregate gradients across micro-batches before applying the optimizer step.

Code Reference

Source Location

Signature

template <typename T>
class ZeroGradient : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

template <typename T, typename T_GRAD>
class InPlaceAccumulator : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

template <typename T, typename T_GRAD>
class InPlaceAccumulatorV2 : public CudaKernel {
  Status ComputeInternal(OpKernelContext* ctx) const;
};

Import

#include "orttraining/training_ops/cuda/optimizer/gradient_control.h"

I/O Contract

Inputs

Name Type Required Description
old_gradient/left_addee Tensor(T) Yes Existing gradient buffer (aliased to output)
right_addee Tensor(T_GRAD) Yes Gradient to accumulate (InPlaceAccumulator only)
do_update/overwrite Tensor(bool) No Control flag (CPU memory)

Outputs

Name Type Description
output Tensor(T) Zeroed or accumulated gradient (in-place)
updated_flag Tensor(bool) Whether update occurred (InPlaceAccumulatorV2 only, CPU)

Usage Examples

// ZeroGradient: zeros gradient buffer in-place
REGISTER_ZERO_GRADIENT_TYPED(float)

// InPlaceAccumulator: accumulates gradients with optional update control
REGISTER_IN_PLACE_TENSOR_ACCUMULATOR_TYPED(float, MLFloat16)

// InPlaceAccumulatorV2: accumulates with overwrite mode support
REGISTER_IN_PLACE_TENSOR_ACCUMULATORV2_TYPED(float, float)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment