Implementation:Microsoft Onnxruntime CUDA SliceGrad
| Knowledge Sources | |
|---|---|
| Domains | Training, CUDA_Kernels |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
Concrete tool for computing the gradient of Slice in the ONNX Runtime CUDA training framework.
Description
Implements the SliceGrad operator for CUDA that distributes upstream gradients back to the full input shape of the original Slice operation. The output gradient tensor is first zero-initialized, then the upstream gradient (from the sliced region) is scattered back to the corresponding positions using SliceImplGrad. The slice parameters (starts, ends, axes, steps) are read from CPU memory inputs. The gradient computation reverses the assignment direction of the standard Slice: instead of copying from input to output, it copies from the upstream gradient into the appropriate region of the zero-initialized output. The GetSlicedOrUnslicedTensor method creates the output tensor with the original data shape.
Usage
Invoked during the backward pass when the model uses Slice operations.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/training_ops/cuda/tensor/slice_grad.cc
- Lines: 1-69
Signature
class SliceGrad : public CudaKernel {
const Tensor* GetSlicedOrUnslicedTensor(OpKernelContext* ctx) const;
Status FillInputVectors(OpKernelContext* ctx, TensorShapeVector& input_starts,
TensorShapeVector& input_ends, TensorShapeVector& input_axes,
TensorShapeVector& input_steps) const;
Status CallSliceImp(size_t element_size, size_t dimension_count,
const TArray<int64_t>& starts_buffer, const TArray<int64_t>& steps_buffer,
const TArray<int64_t>& input_strides, const TArray<fast_divmod>& output_strides,
OpKernelContext* ctx, const TensorShape& output_shape) const;
};
Import
#include "orttraining/training_ops/cuda/tensor/slice_grad.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dY | Tensor(T) | Yes | Upstream gradient (sliced region shape) |
| shape | Tensor(int64_t) | Yes | Original data shape (CPU memory) |
| starts | Tensor(Tind) | Yes | Slice start indices (CPU memory) |
| ends | Tensor(Tind) | Yes | Slice end indices (CPU memory) |
| axes | Tensor(Tind) | No | Axes to slice (CPU memory) |
| steps | Tensor(Tind) | No | Step sizes (CPU memory) |
Outputs
| Name | Type | Description |
|---|---|---|
| dX | Tensor(T) | Gradient with respect to full input (zero-initialized then sliced region filled) |
Usage Examples
ONNX_OPERATOR_KERNEL_EX(SliceGrad, kMSDomain, 1, kCudaExecutionProvider,
(*KernelDefBuilder::Create())
.InputMemoryType(OrtMemTypeCPUInput, 1) // shape
.InputMemoryType(OrtMemTypeCPUInput, 2) // starts
.InputMemoryType(OrtMemTypeCPUInput, 3) // ends
.InputMemoryType(OrtMemTypeCPUInput, 4) // axes
.InputMemoryType(OrtMemTypeCPUInput, 5) // steps
.TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypes()),
SliceGrad);