Principle:Ggml org Ggml Vectorized Math Operations
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Full Name | Ggml_org_Ggml_Vectorized_Math_Operations |
| Short Name | Vectorized_Math_Operations |
| Domain Tags | SIMD, Math, Performance |
| Knowledge Source | GGML |
| Last Updated | 2026-02-10 |
Overview
SIMD-optimized implementations of fundamental vector operations -- dot products, element-wise arithmetic, activation functions, and softmax -- that form the computational building blocks for all higher-level CPU tensor operations.
Description
Vectorized Math Operations is the principle of implementing the lowest-level numerical routines (dot products, element-wise add/sub/mul/div, activation functions, and statistical reductions) using explicit SIMD intrinsics for maximum throughput on CPU hardware. These functions, defined in vec.h and vec.cpp, are the innermost computational primitives that all higher-level CPU tensor operations ultimately invoke.
The vectorized operations span several categories:
Dot products are the most performance-critical primitives, as they form the inner loop of matrix multiplication:
ggml_vec_dot_f32-- float32 dot product with AVX2-accelerated 8-wide accumulationggml_vec_dot_f16-- float16 dot product with FP16-to-FP32 conversion in the loopggml_vec_dot_bf16-- bfloat16 dot product
Element-wise arithmetic provides vectorized binary operations:
ggml_vec_add_f32-- uses_mm256_add_pson AVX2 for 8-wide additionggml_vec_sub_f32,ggml_vec_mul_f32,ggml_vec_div_f32-- similarly vectorizedggml_vec_mad_f32-- multiply-and-add withGGML_VEC_MAD_UNROLLfactor of 32 for deep loop unrolling- FP16 variants (
ggml_vec_add_f16) that convert to FP32, compute, and convert back
Activation functions use lookup-table acceleration for FP16 inputs:
ggml_vec_silu_f32-- SiLU (Swish) activation:x * sigmoid(x)- Precomputed 64K-entry lookup tables (
ggml_table_gelu_f16,ggml_table_gelu_quick_f16) for GELU and Quick GELU, providing O(1) evaluation per element
Statistical operations implement numerically stable reductions:
ggml_vec_soft_max_f32-- softmax with max-subtraction for numerical stabilityggml_vec_log_soft_max_f32-- log-softmax combining log and softmax for reduced overflow riskggml_vec_cvar_f32-- centered variance computation that simultaneously centers the data
Type set/copy utilities provide efficient bulk initialization and copying for all supported element types (i8, i16, i32, f16, bf16, f32).
Usage
Vectorized math operations are used as building blocks across the entire CPU backend:
- Matrix multiplication inner loop:
ggml_vec_dot_f32and its quantized variants are the innermost loop ofggml_compute_forward_mul_mat, executing billions of times per inference. - Normalization layers: RMS norm and layer norm use
ggml_vec_dot_f32(for sum-of-squares),ggml_vec_scale_f32, and element-wise multiply to compute normalized outputs. - Activation layers: SiLU, GELU, and other activations call the vectorized activation functions for each tensor element.
- Softmax computation: The attention mechanism's softmax step uses
ggml_vec_soft_max_f32withGGML_SOFT_MAX_UNROLLfactor of 4 for loop unrolling. - Training backward passes: Gradient computation reuses the same vectorized primitives (add, multiply, scale) for backward-pass accumulations.
Theoretical Basis
Loop Vectorization with Intrinsics
While compilers can auto-vectorize simple loops, explicit SIMD intrinsics guarantee vectorization regardless of compiler optimization level or loop complexity. GGML's vectorized operations use a two-phase pattern: a vectorized loop body that processes N elements per iteration (where N matches the SIMD register width -- 8 for AVX2 float32), followed by a scalar cleanup loop for remaining elements. For example, ggml_vec_add_f32 processes 8 floats per iteration with _mm256_add_ps on AVX2, then handles the tail element-by-element.
Lookup Table Acceleration
For expensive transcendental functions (GELU, Quick GELU), GGML precomputes the function output for every possible FP16 input value (65536 entries, requiring 128 KB of memory per table). Since model activations are often computed in or converted through FP16 precision, this provides exact FP16-precision results with a single table lookup per element, which is significantly faster than computing the transcendental function using the Taylor series or hardware FP unit. The tables are computed once at initialization time.
Unrolling for Pipeline Utilization
The unroll factors (GGML_VEC_DOT_UNROLL = 2, GGML_VEC_MAD_UNROLL = 32, GGML_SOFT_MAX_UNROLL = 4) are tuned to keep the CPU's execution pipeline full. Modern out-of-order processors can execute multiple independent SIMD instructions in parallel, but only if there are enough independent instructions in flight. Unrolling the loop body by a factor of 2-32 provides multiple independent accumulator variables, breaking the data dependency chain that would otherwise serialize the additions. The optimal unroll factor depends on the number of physical SIMD execution units and the latency of the accumulation operation.
Numerical Stability in Reductions
The softmax implementations use the standard max-subtraction trick: before exponentiating, the maximum value is subtracted from all elements, ensuring that the largest exponent is 0 and preventing floating-point overflow. The ggml_float type (typedef to double) is used for accumulation in critical reductions to minimize rounding error when summing many small values, a concern when computing variance or normalizing over long sequences.