Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Ggml Vectorized Math Operations

From Leeroopedia


Attribute Value
Page Type Principle
Full Name Ggml_org_Ggml_Vectorized_Math_Operations
Short Name Vectorized_Math_Operations
Domain Tags SIMD, Math, Performance
Knowledge Source GGML
Last Updated 2026-02-10

Overview

SIMD-optimized implementations of fundamental vector operations -- dot products, element-wise arithmetic, activation functions, and softmax -- that form the computational building blocks for all higher-level CPU tensor operations.

Description

Vectorized Math Operations is the principle of implementing the lowest-level numerical routines (dot products, element-wise add/sub/mul/div, activation functions, and statistical reductions) using explicit SIMD intrinsics for maximum throughput on CPU hardware. These functions, defined in vec.h and vec.cpp, are the innermost computational primitives that all higher-level CPU tensor operations ultimately invoke.

The vectorized operations span several categories:

Dot products are the most performance-critical primitives, as they form the inner loop of matrix multiplication:

  • ggml_vec_dot_f32 -- float32 dot product with AVX2-accelerated 8-wide accumulation
  • ggml_vec_dot_f16 -- float16 dot product with FP16-to-FP32 conversion in the loop
  • ggml_vec_dot_bf16 -- bfloat16 dot product

Element-wise arithmetic provides vectorized binary operations:

  • ggml_vec_add_f32 -- uses _mm256_add_ps on AVX2 for 8-wide addition
  • ggml_vec_sub_f32, ggml_vec_mul_f32, ggml_vec_div_f32 -- similarly vectorized
  • ggml_vec_mad_f32 -- multiply-and-add with GGML_VEC_MAD_UNROLL factor of 32 for deep loop unrolling
  • FP16 variants (ggml_vec_add_f16) that convert to FP32, compute, and convert back

Activation functions use lookup-table acceleration for FP16 inputs:

  • ggml_vec_silu_f32 -- SiLU (Swish) activation: x * sigmoid(x)
  • Precomputed 64K-entry lookup tables (ggml_table_gelu_f16, ggml_table_gelu_quick_f16) for GELU and Quick GELU, providing O(1) evaluation per element

Statistical operations implement numerically stable reductions:

  • ggml_vec_soft_max_f32 -- softmax with max-subtraction for numerical stability
  • ggml_vec_log_soft_max_f32 -- log-softmax combining log and softmax for reduced overflow risk
  • ggml_vec_cvar_f32 -- centered variance computation that simultaneously centers the data

Type set/copy utilities provide efficient bulk initialization and copying for all supported element types (i8, i16, i32, f16, bf16, f32).

Usage

Vectorized math operations are used as building blocks across the entire CPU backend:

  • Matrix multiplication inner loop: ggml_vec_dot_f32 and its quantized variants are the innermost loop of ggml_compute_forward_mul_mat, executing billions of times per inference.
  • Normalization layers: RMS norm and layer norm use ggml_vec_dot_f32 (for sum-of-squares), ggml_vec_scale_f32, and element-wise multiply to compute normalized outputs.
  • Activation layers: SiLU, GELU, and other activations call the vectorized activation functions for each tensor element.
  • Softmax computation: The attention mechanism's softmax step uses ggml_vec_soft_max_f32 with GGML_SOFT_MAX_UNROLL factor of 4 for loop unrolling.
  • Training backward passes: Gradient computation reuses the same vectorized primitives (add, multiply, scale) for backward-pass accumulations.

Theoretical Basis

Loop Vectorization with Intrinsics

While compilers can auto-vectorize simple loops, explicit SIMD intrinsics guarantee vectorization regardless of compiler optimization level or loop complexity. GGML's vectorized operations use a two-phase pattern: a vectorized loop body that processes N elements per iteration (where N matches the SIMD register width -- 8 for AVX2 float32), followed by a scalar cleanup loop for remaining elements. For example, ggml_vec_add_f32 processes 8 floats per iteration with _mm256_add_ps on AVX2, then handles the tail element-by-element.

Lookup Table Acceleration

For expensive transcendental functions (GELU, Quick GELU), GGML precomputes the function output for every possible FP16 input value (65536 entries, requiring 128 KB of memory per table). Since model activations are often computed in or converted through FP16 precision, this provides exact FP16-precision results with a single table lookup per element, which is significantly faster than computing the transcendental function using the Taylor series or hardware FP unit. The tables are computed once at initialization time.

Unrolling for Pipeline Utilization

The unroll factors (GGML_VEC_DOT_UNROLL = 2, GGML_VEC_MAD_UNROLL = 32, GGML_SOFT_MAX_UNROLL = 4) are tuned to keep the CPU's execution pipeline full. Modern out-of-order processors can execute multiple independent SIMD instructions in parallel, but only if there are enough independent instructions in flight. Unrolling the loop body by a factor of 2-32 provides multiple independent accumulator variables, breaking the data dependency chain that would otherwise serialize the additions. The optimal unroll factor depends on the number of physical SIMD execution units and the latency of the accumulation operation.

Numerical Stability in Reductions

The softmax implementations use the standard max-subtraction trick: before exponentiating, the maximum value is subtracted from all elements, ensuring that the largest exponent is 0 and preventing floating-point overflow. The ggml_float type (typedef to double) is used for accumulation in critical reductions to minimize rounding error when summing many small values, a concern when computing variance or normalizing over long sequences.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment