Implementation:Ggml org Ggml Sycl norm

Knowledge Sources	GGML
Domains	ML_Infrastructure, GPU_Compute
Last Updated	2025-05-15 12:00 GMT

Overview

SYCL normalization kernels implementing LayerNorm, RMSNorm, GroupNorm, L2Norm, and RMSNorm backward pass for transformer inference.

Description

norm.cpp implements the essential normalization operations used in transformer layers for the SYCL backend. The file provides five normalization variants:

norm_f32 (LayerNorm): Computes per-row mean and variance using a two-pass approach (accumulate sum and sum-of-squares in a single pass via sycl::float2), then normalizes each element as (x - mean) * rsqrt(var + eps). Supports strided (non-contiguous) inputs via calculate_offset helpers.

group_norm_f32: Computes statistics over groups of elements spanning multiple rows, used in convolutional architectures. Accumulates group mean and variance, then normalizes.

rms_norm_f32 (RMSNorm): Computes root-mean-square normalization without mean subtraction: x * rsqrt(mean(x^2) + eps). This is the normalization used by LLaMA and most modern LLMs.

rms_norm_back_f32: Gradient computation for RMSNorm backward pass, used during fine-tuning.

l2_norm_f32: L2 normalization that divides each element by the L2 norm of the row, with epsilon for numerical stability.

All kernels use warp-level reductions (warp_reduce_sum) for efficient parallel reduction within subgroups, with optional shared memory cross-warp reduction for work-groups larger than one warp. Block sizes are dynamically selected from the presets module based on the number of columns.

Usage

Called from the main SYCL backend when the compute graph contains normalization operations (GGML_OP_NORM, GGML_OP_RMS_NORM, GGML_OP_GROUP_NORM, GGML_OP_L2_NORM, GGML_OP_RMS_NORM_BACK).

Code Reference

Source Location

Repository: GGML
File: src/ggml-sycl/norm.cpp
Lines: 654

Signatures

// Core kernel functions
static void norm_f32(const float* x, float* dst, const int ncols,
    const int64_t stride_row, const int64_t stride_channel, const int64_t stride_sample,
    const float eps, const sycl::nd_item<3>& item_ct1, sycl::float2* s_sum, int block_size);

static void group_norm_f32(const float* x, float* dst, const int group_size,
    const int ne_elements, const float eps,
    const sycl::nd_item<3>& item_ct1, float* s_sum, int block_size);

static void rms_norm_f32(const float* x, float* dst, const int ncols, const float eps, ...);

// Public dispatch functions
void ggml_sycl_op_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
void ggml_sycl_op_group_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);
void ggml_sycl_op_rms_norm(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_op_rms_norm_back(ggml_backend_sycl_context & ctx, ggml_tensor * dst);
void ggml_sycl_op_l2_norm(ggml_backend_sycl_context& ctx, ggml_tensor* dst);

I/O Contract

Inputs

Name	Type	Required	Description
ctx	ggml_backend_sycl_context &	Yes	SYCL backend context providing the device queue
dst	ggml_tensor *	Yes	Destination tensor; source accessed via dst->src[0], epsilon from dst->op_params

Outputs

Name	Type	Description
dst->data	void *	Normalized output tensor (f32)

Usage Examples

// RMSNorm for transformer layer normalization:
ggml_sycl_op_rms_norm(sycl_ctx, rms_norm_output);

// LayerNorm:
ggml_sycl_op_norm(sycl_ctx, norm_output);

// Group normalization for convolutional layers:
ggml_sycl_op_group_norm(sycl_ctx, group_norm_output);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment