Implementation:Ggml org Ggml Sycl norm
| Knowledge Sources | |
|---|---|
| Domains | ML_Infrastructure, GPU_Compute |
| Last Updated | 2025-05-15 12:00 GMT |
Overview
SYCL normalization kernels implementing LayerNorm, RMSNorm, GroupNorm, L2Norm, and RMSNorm backward pass for transformer inference.
Description
norm.cpp implements the essential normalization operations used in transformer layers for the SYCL backend. The file provides five normalization variants:
- norm_f32 (LayerNorm): Computes per-row mean and variance using a two-pass approach (accumulate sum and sum-of-squares in a single pass via sycl::float2), then normalizes each element as (x - mean) * rsqrt(var + eps). Supports strided (non-contiguous) inputs via calculate_offset helpers.
- group_norm_f32: Computes statistics over groups of elements spanning multiple rows, used in convolutional architectures. Accumulates group mean and variance, then normalizes.
- rms_norm_f32 (RMSNorm): Computes root-mean-square normalization without mean subtraction: x * rsqrt(mean(x^2) + eps). This is the normalization used by LLaMA and most modern LLMs.
- rms_norm_back_f32: Gradient computation for RMSNorm backward pass, used during fine-tuning.
- l2_norm_f32: L2 normalization that divides each element by the L2 norm of the row, with epsilon for numerical stability.
All kernels use warp-level reductions (warp_reduce_sum) for efficient parallel reduction within subgroups, with optional shared memory cross-warp reduction for work-groups larger than one warp. Block sizes are dynamically selected from the presets module based on the number of columns.
Usage
Called from the main SYCL backend when the compute graph contains normalization operations (GGML_OP_NORM, GGML_OP_RMS_NORM, GGML_OP_GROUP_NORM, GGML_OP_L2_NORM, GGML_OP_RMS_NORM_BACK).
Code Reference
Source Location
- Repository: GGML
- File: src/ggml-sycl/norm.cpp
- Lines: 654
Signatures
// Core kernel functions
static void norm_f32(const float* x, float* dst, const int ncols,
const int64_t stride_row, const int64_t stride_channel, const int64_t stride_sample,
const float eps, const sycl::nd_item<3>& item_ct1, sycl::float2* s_sum, int block_size);
static void group_norm_f32(const float* x, float* dst, const int group_size,
const int ne_elements, const float eps,
const sycl::nd_item<3>& item_ct1, float* s_sum, int block_size);
static void rms_norm_f32(const float* x, float* dst, const int ncols, const float eps, ...);
// Public dispatch functions
void ggml_sycl_op_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
void ggml_sycl_op_group_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
void ggml_sycl_op_rms_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_op_rms_norm_back(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_op_l2_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ctx | ggml_backend_sycl_context & | Yes | SYCL backend context providing the device queue |
| dst | ggml_tensor * | Yes | Destination tensor; source accessed via dst->src[0], epsilon from dst->op_params |
Outputs
| Name | Type | Description |
|---|---|---|
| dst->data | void * | Normalized output tensor (f32) |
Usage Examples
// RMSNorm for transformer layer normalization:
ggml_sycl_op_rms_norm(sycl_ctx, rms_norm_output);
// LayerNorm:
ggml_sycl_op_norm(sycl_ctx, norm_output);
// Group normalization for convolutional layers:
ggml_sycl_op_group_norm(sycl_ctx, group_norm_output);