Implementation:Ggml org Ggml Sycl element wise
| Knowledge Sources | |
|---|---|
| Domains | ML_Infrastructure, GPU_Compute |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
SYCL element-wise operation kernels implementing the full set of activation functions, unary operations, GLU variants, and miscellaneous per-element transformations for neural network inference.
Description
element_wise.cpp is one of the largest kernel files in the SYCL backend, implementing the broad family of element-wise operations required for transformer inference. The architecture uses templated operation functions dispatched through generic kernel launchers:
- Activation functions: op_gelu (GELU with tanh approximation), op_gelu_erf (exact GELU using error function), op_gelu_quick, op_silu (SiLU/Swish), op_relu, op_sigmoid, op_hardsigmoid, op_hardswish, op_softplus, op_tanh, op_elu, op_leaky_relu
- Unary math operations: op_sgn, op_abs, op_neg, op_step, op_sqrt, op_sqr, op_sin, op_cos, op_exp, op_log, op_floor, op_ceil, op_round, op_trunc
- GLU (Gated Linear Unit) variants: geglu, geglu_erf, geglu_quick, reglu, swiglu, swiglu_oai -- each combining a gate activation with element-wise multiplication
- Other operations: acc_f32 (accumulate with offset), upscale, clamp, arange
The implementation uses two key macros -- SYCL_GLOBAL_ID_LOOP for work-item iteration and SYCL_LOCAL_ID_CALC for index computation -- and dispatches operations through sycl::parallel_for with configurable block sizes from the presets module.
Usage
Called from the main SYCL backend dispatcher for all unary and element-wise operations in the compute graph. Each public function (e.g., ggml_sycl_gelu, ggml_sycl_silu) extracts tensor data pointers, determines launch configuration, and invokes the appropriate templated kernel.
Code Reference
Source Location
- Repository: GGML
- File: src/ggml-sycl/element_wise.cpp
- Lines: 1216
Signatures
// Representative activation function templates
template<typename T> static __dpct_inline__ T op_gelu(T x);
template<typename T> static __dpct_inline__ T op_silu(T x);
template<typename T> static __dpct_inline__ T op_relu(T x);
template<typename T> static __dpct_inline__ T op_sigmoid(T x);
// Public entry points (selected subset)
void ggml_sycl_gelu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_silu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_relu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sigmoid(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_tanh(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_exp(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_log(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sqrt(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_sqr(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_acc(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_upscale(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_clamp(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_arange(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
// GLU variants
void ggml_sycl_geglu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_reglu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_swiglu(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ctx | ggml_backend_sycl_context & | Yes | SYCL backend context providing the device queue |
| dst | ggml_tensor * | Yes | Destination tensor; source data accessed via dst->src[0] (and dst->src[1] for binary ops) |
Outputs
| Name | Type | Description |
|---|---|---|
| dst->data | void * | Result tensor with the element-wise operation applied |
Usage Examples
// Applying GELU activation to a tensor:
ggml_sycl_gelu(sycl_ctx, gelu_output_tensor);
// Applying SiLU (Swish) activation:
ggml_sycl_silu(sycl_ctx, silu_output_tensor);
// Clamping values to a range:
ggml_sycl_clamp(sycl_ctx, clamp_output_tensor);