Principle:Dotnet Machinelearning SIMD Vector Math
| Knowledge Sources | |
|---|---|
| Domains | Linear Algebra, SIMD Optimization, High-Performance Computing |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
SIMD (Single Instruction, Multiple Data) vector math is a hardware-level parallelism technique that executes a single arithmetic operation on multiple data elements simultaneously, enabling significant speedups for linear algebra operations in machine learning.
Description
Modern CPUs include SIMD instruction sets that operate on wide registers, allowing a single instruction to process multiple floating-point values in parallel. For machine learning workloads, which are dominated by linear algebra operations on large float arrays, SIMD provides a critical performance advantage without requiring GPU hardware or multi-threading.
The primary x86 SIMD instruction sets relevant to ML.NET are:
- SSE (Streaming SIMD Extensions): Operates on 128-bit __m128 registers, processing 4 single-precision floats simultaneously. Available on virtually all x86-64 processors. Includes SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2.
- AVX (Advanced Vector Extensions): Operates on 256-bit __m256 registers, processing 8 single-precision floats simultaneously. Available on Intel Sandy Bridge (2011) and later, AMD Bulldozer and later.
- AVX-512: Operates on 512-bit __m512 registers, processing 16 single-precision floats simultaneously. Available on select Intel Xeon and consumer processors.
SIMD vectorization applies to any operation that can be expressed as an element-wise or reduction pattern over contiguous (or gather-addressable) float arrays. In ML.NET, SIMD is used for:
- Matrix-vector multiplication (the computational core of neural network forward and backward passes)
- Vector scaling, addition, and element-wise products (gradient updates, normalization)
- Reductions (dot products, norms, sums) that combine an entire vector into a scalar
- Proximal operators (L1 soft-thresholding in SDCA)
Usage
SIMD vectorization should be applied whenever:
- The operation processes contiguous or regularly-strided float arrays
- The array length is large enough to amortize SIMD setup overhead (typically >16 elements)
- The same arithmetic operation applies uniformly to all elements
- The target platform supports the required instruction set (SSE is universally available on x86-64; AVX requires runtime detection)
Theoretical Basis
SIMD Execution Model
A SIMD register of width W bits can hold W / 32 single-precision floats. A single SIMD instruction performs the same operation on all lanes simultaneously:
Scalar: c[0] = a[0] + b[0] (1 addition per cycle)
SSE: c[0:3] = a[0:3] + b[0:3] (4 additions per cycle, 128-bit)
AVX: c[0:7] = a[0:7] + b[0:7] (8 additions per cycle, 256-bit)
For a vector of n elements, scalar code requires n operations, while SSE code requires ceil(n/4) SIMD operations plus a tail loop for the remaining n mod 4 elements.
Theoretical Speedup
The maximum theoretical speedup from SIMD is equal to the lane count:
Speedup_SSE = 4x (128-bit / 32-bit per float)
Speedup_AVX = 8x (256-bit / 32-bit per float)
In practice, speedup is limited by:
- Memory bandwidth: SIMD increases arithmetic throughput but not memory bandwidth. For memory-bound operations (e.g., streaming through a large array with a single pass), the bottleneck is cache/memory access, not ALU throughput.
- Alignment penalties: Misaligned loads and stores may incur latency penalties on some microarchitectures. Aligned operations (_mm_load_ps vs _mm_loadu_ps) require 16-byte alignment for SSE and 32-byte alignment for AVX.
- Horizontal operations: Reductions (sum, max) require cross-lane operations (_mm_hadd_ps, shuffles) that are slower than vertical (element-wise) operations.
- Gather/scatter overhead: Sparse vector operations require gathering non-contiguous elements into SIMD registers, which cannot be done with a single load instruction in SSE (though AVX2 adds _mm256_i32gather_ps).
Key Operation Patterns
Vertical Operations
Element-wise operations where output lane i depends only on input lane i:
pd[i] = ps[i] * a (Scale)
pd[i] = pa[i] + pb[i] (Add)
pd[i] = ps1[i] * ps2[i] (MulElementWise)
pd[i] += a * ps[i] (AddScale: fused multiply-add pattern)
These map directly to _mm_mul_ps, _mm_add_ps, and sequences thereof.
Horizontal Reductions
Operations that collapse a vector into a single scalar:
Sum: result = sum(ps[i]) => accumulate in __m128, then _mm_hadd_ps twice
SumSq: result = sum(ps[i]^2) => multiply, accumulate, hadd
Dot: result = sum(pa[i] * pb[i]) => multiply, accumulate, hadd
MaxAbs: result = max(|ps[i]|) => mask sign bit, _mm_max_ps, then shuffle-reduce
The horizontal add pattern in SSE3:
__m128 res = [a, b, c, d]
_mm_hadd_ps(res, res) => [a+b, c+d, a+b, c+d]
_mm_hadd_ps(res, res) => [a+b+c+d, a+b+c+d, a+b+c+d, a+b+c+d]
_mm_cvtss_f32(res) => a+b+c+d as scalar float
An alternative reduction without SSE3 _mm_hadd_ps uses shuffles:
res = _mm_add_ps(res, _mm_movehl_ps(res, res)); // [a+c, b+d, ...]
res = _mm_add_ps(res, _mm_shuffle_ps(res, res, 0xB1)); // [a+b+c+d, ...]
Matrix-Vector Multiplication
For a matrix M of size R x C and vector v of length C, the product Mv is computed by processing 4 rows at a time. Each iteration loads 4 elements from 4 consecutive rows and multiplies by 4 elements of v, accumulating partial sums in 4 __m128 registers. After iterating over all columns, horizontal adds combine the partial sums:
for each group of 4 rows:
res0 = res1 = res2 = res3 = 0
for each group of 4 columns:
x = load 4 elements of v
res0 += M[row0, col:col+4] * x
res1 += M[row1, col:col+4] * x
res2 += M[row2, col:col+4] * x
res3 += M[row3, col:col+4] * x
hadd to reduce each res to a single float
store [res0, res1, res2, res3] to output
This achieves high arithmetic intensity by reusing the loaded source vector across 4 row computations.
Branchless Soft-Thresholding
The L1 proximal operator sign(x) * max(0, |x| - threshold) is implemented entirely with bitwise SIMD operations to avoid branch misprediction:
sign = x AND 0x80000000 (extract sign bit)
abs_x = x XOR sign (clear sign bit)
cond = cmpgt(abs_x, threshold) (comparison mask: all 1s if true)
signed_t = sign XOR threshold (negate threshold if x was negative)
result = (x - signed_t) AND cond (zero if |x| <= threshold)
Memory Alignment
SSE aligned loads (_mm_load_ps) require 16-byte alignment and are faster on older microarchitectures. When data is not aligned, the code must either:
- Use unaligned loads (_mm_loadu_ps) throughout, which is simpler but may be slightly slower on some processors
- Handle the misaligned prefix and suffix with masked operations, then use aligned loads for the middle portion
ML.NET's Scale and Sum functions implement the second strategy, using LeadingAlignmentMask and TrailingAlignmentMask lookup tables to selectively process boundary elements.
Sparse Vector Operations
When vectors are sparse (most elements are zero), only the nonzero elements need to be processed. SIMD sparse operations use an index array to gather elements from a dense array:
// Gather: load 4 non-contiguous elements into a SIMD register
__m128 x = _mm_setr_ps(dense[idx[0]], dense[idx[1]], dense[idx[2]], dense[idx[3]]);
// Scatter: store 4 elements back to non-contiguous positions using rotate+store
_mm_store_ss(dense + idx[0], x);
x = _mm_shuffle_ps(x, x, 0x39); // rotate
_mm_store_ss(dense + idx[1], x);
// ... repeat for idx[2] and idx[3]
This is less efficient than contiguous SIMD access but still significantly faster than scalar loops for large sparse vectors, because the multiply-accumulate portion still benefits from SIMD parallelism.