Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime CUDA SliceGrad

From Leeroopedia
Revision as of 15:45, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_Onnxruntime_CUDA_SliceGrad.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Training, CUDA_Kernels
Last Updated 2026-02-10 04:00 GMT

Overview

Concrete tool for computing the gradient of Slice in the ONNX Runtime CUDA training framework.

Description

Implements the SliceGrad operator for CUDA that distributes upstream gradients back to the full input shape of the original Slice operation. The output gradient tensor is first zero-initialized, then the upstream gradient (from the sliced region) is scattered back to the corresponding positions using SliceImplGrad. The slice parameters (starts, ends, axes, steps) are read from CPU memory inputs. The gradient computation reverses the assignment direction of the standard Slice: instead of copying from input to output, it copies from the upstream gradient into the appropriate region of the zero-initialized output. The GetSlicedOrUnslicedTensor method creates the output tensor with the original data shape.

Usage

Invoked during the backward pass when the model uses Slice operations.

Code Reference

Source Location

Signature

class SliceGrad : public CudaKernel {
  const Tensor* GetSlicedOrUnslicedTensor(OpKernelContext* ctx) const;
  Status FillInputVectors(OpKernelContext* ctx, TensorShapeVector& input_starts,
                          TensorShapeVector& input_ends, TensorShapeVector& input_axes,
                          TensorShapeVector& input_steps) const;
  Status CallSliceImp(size_t element_size, size_t dimension_count,
                      const TArray<int64_t>& starts_buffer, const TArray<int64_t>& steps_buffer,
                      const TArray<int64_t>& input_strides, const TArray<fast_divmod>& output_strides,
                      OpKernelContext* ctx, const TensorShape& output_shape) const;
};

Import

#include "orttraining/training_ops/cuda/tensor/slice_grad.h"

I/O Contract

Inputs

Name Type Required Description
dY Tensor(T) Yes Upstream gradient (sliced region shape)
shape Tensor(int64_t) Yes Original data shape (CPU memory)
starts Tensor(Tind) Yes Slice start indices (CPU memory)
ends Tensor(Tind) Yes Slice end indices (CPU memory)
axes Tensor(Tind) No Axes to slice (CPU memory)
steps Tensor(Tind) No Step sizes (CPU memory)

Outputs

Name Type Description
dX Tensor(T) Gradient with respect to full input (zero-initialized then sliced region filled)

Usage Examples

ONNX_OPERATOR_KERNEL_EX(SliceGrad, kMSDomain, 1, kCudaExecutionProvider,
    (*KernelDefBuilder::Create())
        .InputMemoryType(OrtMemTypeCPUInput, 1)  // shape
        .InputMemoryType(OrtMemTypeCPUInput, 2)  // starts
        .InputMemoryType(OrtMemTypeCPUInput, 3)  // ends
        .InputMemoryType(OrtMemTypeCPUInput, 4)  // axes
        .InputMemoryType(OrtMemTypeCPUInput, 5)  // steps
        .TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypes()),
    SliceGrad);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment