Implementation:Dotnet Machinelearning CpuMathNative Sse
| Knowledge Sources | |
|---|---|
| Domains | Linear Algebra, SIMD Optimization, Native Interop |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
SIMD-accelerated CPU math operations using SSE and AVX intrinsics for high-performance vector and matrix computations in the ML.NET native layer.
Description
Sse.cpp is the core native math kernel for ML.NET, providing C++ implementations of linear algebra primitives that are invoked from managed C# code via P/Invoke. Every exported function uses SSE (Streaming SIMD Extensions) intrinsics to process four single-precision floats simultaneously through 128-bit __m128 registers. The file implements the full set of vector and matrix operations required by ML.NET trainers, including matrix-vector products, element-wise arithmetic, reductions (sum, dot product, norms), and specialized routines for the SDCA (Stochastic Dual Coordinate Ascent) optimizer.
Functions follow a naming convention with suffixes that indicate memory layout assumptions:
- U suffix: unaligned and unpadded data (most general case)
- S suffix: sparse vector with an index array
- P suffix: partial sparse vector (a slice of a larger sparse vector)
- Tran suffix: the matrix is transposed (column-major interpretation)
- A suffix: aligned and padded for SSE (16-byte alignment)
- X suffix: aligned and padded for AVX (32-byte alignment)
Each function processes elements in blocks of 4 (SSE lane width) with a scalar tail loop for any remaining elements that do not fill a full SIMD register. Several functions, such as Scale and Sum, include alignment-aware code paths that use masking to handle partially overlapping aligned and unaligned reads at array boundaries.
Usage
These functions are called internally by ML.NET trainers whenever hardware-accelerated math is needed on .NET Standard targets. On .NET Core, the managed C# implementations in SseIntrinsics.cs and AvxIntrinsics.cs using System.Runtime.Intrinsics may be preferred. On .NET Framework and .NET Standard, the P/Invoke path through Thunk.cs into this native library is the primary execution path.
Code Reference
Source Location
- Repository: Dotnet_Machinelearning
- File: src/Native/CpuMathNative/Sse.cpp
- Lines: 1-887
Signature
// Matrix operations
EXPORT_API(void) MatMul(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulP(const float * pmat, const int * pposSrc, const float * psrc,
int posMin, int iposMin, int iposLim, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulTran(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);
// Scalar and scaling operations
EXPORT_API(void) AddScalarU(float a, float * pd, int c);
EXPORT_API(void) Scale(float a, float * pd, int c);
EXPORT_API(void) ScaleSrcU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) ScaleAddU(float a, float b, float * pd, int c);
EXPORT_API(void) AddScaleU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) AddScaleCopyU(float a, const float * ps, const float * pd, float * pr, int c);
EXPORT_API(void) AddScaleSU(float a, const float * ps, const int * pi, float * pd, int c);
// Element-wise operations
EXPORT_API(void) AddU(const float * ps, float * pd, int c);
EXPORT_API(void) AddSU(const float * ps, const int * pi, float * pd, int c);
EXPORT_API(void) MulElementWiseU(const float * ps1, const float * ps2, float * pd, int c);
// Reduction operations
EXPORT_API(float) Sum(const float * pValues, int length);
EXPORT_API(float) SumSqU(const float * ps, int c);
EXPORT_API(float) SumSqDiffU(float mean, const float * ps, int c);
EXPORT_API(float) SumAbsU(const float * ps, int c);
EXPORT_API(float) SumAbsDiffU(float mean, const float * ps, int c);
EXPORT_API(float) MaxAbsU(const float * ps, int c);
EXPORT_API(float) MaxAbsDiffU(float mean, const float * ps, int c);
// Dot product and distance
EXPORT_API(float) DotU(const float * pa, const float * pb, int c);
EXPORT_API(float) DotSU(const float * pa, const float * pb, const int * pi, int c);
EXPORT_API(float) Dist2(const float * px, const float * py, int c);
// Zeroing operations
EXPORT_API(void) ZeroItemsU(float * pd, int c, const int * pindices, int cindices);
EXPORT_API(void) ZeroMatrixItemsCore(float * pd, int c, int ccol, int cfltRow,
const int * pindices, int cindices);
// SDCA L1 regularization
EXPORT_API(void) SdcaL1UpdateU(float primalUpdate, const float * ps, float threshold,
float * pd1, float * pd2, int c);
EXPORT_API(void) SdcaL1UpdateSU(float primalUpdate, const float * ps, const int * pi,
float threshold, float * pd1, float * pd2, int c);
Import
// P/Invoke declarations from src/Microsoft.ML.CpuMath/Thunk.cs
using System.Runtime.InteropServices;
using System.Security;
internal static unsafe class Thunk
{
internal const string NativePath = "CpuMathNative";
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void MatMul(float* pmat, float* psrc, float* pdst, int crow, int ccol);
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void MatMulP(float* pmat, int* pposSrc, float* psrc,
int posMin, int iposMin, int iposLim, float* pdst, int crow, int ccol);
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void MatMulTran(float* pmat, float* psrc, float* pdst, int crow, int ccol);
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void Scale(float a, float* pd, int c);
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern float DotU(float* pa, float* pb, int c);
[DllImport(NativePath), SuppressUnmanagedCodeSecurity]
public static extern void SdcaL1UpdateU(float primalUpdate, float* ps,
float threshold, float* pd1, float* pd2, int c);
// ... additional P/Invoke declarations for all exported functions
}
I/O Contract
Inputs
MatMul
| Name | Type | Required | Description |
|---|---|---|---|
| pmat | const float* | Yes | Pointer to row-major matrix of dimensions crow x ccol, aligned and padded to 16 bytes |
| psrc | const float* | Yes | Pointer to source vector of length ccol, aligned and padded to 16 bytes |
| pdst | float* | Yes | Pointer to destination vector of length crow (output buffer, overwritten) |
| crow | int | Yes | Number of rows in the matrix (must be a multiple of 4) |
| ccol | int | Yes | Number of columns in the matrix (must be a multiple of 4) |
MatMulP (Partial Sparse)
| Name | Type | Required | Description |
|---|---|---|---|
| pmat | const float* | Yes | Pointer to row-major matrix |
| pposSrc | const int* | Yes | Array of column indices representing nonzero positions in the sparse vector |
| psrc | const float* | Yes | Sparse source values corresponding to pposSrc positions |
| posMin | int | Yes | Minimum position offset for indexing into pmat and psrc |
| iposMin | int | Yes | Start index into pposSrc for the partial range |
| iposLim | int | Yes | End index (exclusive) into pposSrc for the partial range |
| pdst | float* | Yes | Destination vector of length crow (accumulated into) |
| crow | int | Yes | Number of rows (must be a multiple of 4) |
| ccol | int | Yes | Number of columns |
Reduction Functions (Sum, SumSqU, DotU, etc.)
| Name | Type | Required | Description |
|---|---|---|---|
| ps / pValues / pa | const float* | Yes | Source vector pointer (may be unaligned) |
| pb | const float* | Conditional | Second source vector for dot product and distance operations |
| mean | float | Conditional | Subtracted from each element before squaring/absolute value (used by SumSqDiffU, SumAbsDiffU, MaxAbsDiffU) |
| c / length | int | Yes | Number of elements |
SdcaL1UpdateU
| Name | Type | Required | Description |
|---|---|---|---|
| primalUpdate | float | Yes | Primal variable update scaling factor |
| ps | const float* | Yes | Source gradient vector |
| threshold | float | Yes | L1 regularization threshold for soft-thresholding |
| pd1 | float* | Yes | Weight vector (updated in-place: pd1[i] += ps[i] * primalUpdate) |
| pd2 | float* | Yes | Proximal output: soft-threshold of pd1 with threshold |
| c | int | Yes | Number of elements |
Outputs
| Name | Type | Description |
|---|---|---|
| pdst (MatMul) | float* | Result vector of matrix-vector product, length crow |
| pd (Scale, AddScalarU, etc.) | float* | Modified in-place destination vector |
| return (Sum) | float | Scalar sum of all elements in the vector |
| return (SumSqU) | float | Sum of squares: sum(ps[i]^2) |
| return (SumSqDiffU) | float | Sum of squared differences: sum((ps[i] - mean)^2) |
| return (SumAbsU) | float | ps[i]|) |
| return (MaxAbsU) | float | ps[i]|) |
| return (DotU) | float | Dot product: sum(pa[i] * pb[i]) |
| return (Dist2) | float | Squared Euclidean distance: sum((px[i] - py[i])^2) |
| pd2 (SdcaL1UpdateU) | float* | Soft-thresholded weight vector for L1 proximal update |
Usage Examples
Matrix-Vector Multiplication
// Multiply a 4x8 matrix by an 8-element vector, producing a 4-element result.
// Both matrix and source must be 16-byte aligned and padded to multiples of 4.
float mat[32] __attribute__((aligned(16))); // 4 rows x 8 cols
float src[8] __attribute__((aligned(16)));
float dst[4] __attribute__((aligned(16)));
// ... populate mat and src ...
MatMul(mat, src, dst, 4, 8);
// dst now contains the 4-element result vector.
Dot Product on Unaligned Data
// Compute dot product of two unaligned float arrays.
float a[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[] = {5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
float result = DotU(a, b, 5);
// result = 1*5 + 2*4 + 3*3 + 4*2 + 5*1 = 35.0
SDCA L1 Proximal Update
// Perform SDCA L1 update: accumulate gradient into weights, then soft-threshold.
// pd1[i] += ps[i] * primalUpdate
// pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold)
float gradients[8] = { /* ... */ };
float weights[8] = { /* ... */ };
float proximal[8] = { /* ... */ };
SdcaL1UpdateU(0.01f, gradients, 0.001f, weights, proximal, 8);
Sparse Dot Product
// Compute dot product where pb is dense and pa is accessed via index array pi.
// result = sum(pa[pi[k]] * pb[k]) for k in [0, c)
float dense_a[1000] = { /* full feature vector */ };
float sparse_vals[3] = {0.5f, 1.2f, 0.8f};
int sparse_idx[3] = {10, 200, 999};
float result = DotSU(dense_a, sparse_vals, sparse_idx, 3);
Implementation Details
SSE Intrinsic Patterns
The file uses several recurring SSE patterns:
Horizontal reduction (used by Sum, DotU, SumSqU, etc.):
// Accumulate 4 partial sums in an __m128 register, then reduce to scalar.
res = _mm_hadd_ps(res, res); // [a+b, c+d, a+b, c+d]
res = _mm_hadd_ps(res, res); // [a+b+c+d, ...]
return _mm_cvtss_f32(res); // extract lowest float
Absolute value via bit masking (used by SumAbsU, MaxAbsU):
// Clear the sign bit of all 4 floats simultaneously.
__m128 mask = _mm_castsi128_ps(_mm_set1_epi32(0x7FFFFFFF));
__m128 abs_val = _mm_and_ps(value, mask);
Alignment handling (used by Scale, Sum):
// Check 16-byte alignment and use masking for boundary elements.
uintptr_t misalignment = (uintptr_t)(pd) % 16;
// Use LeadingAlignmentMask / TrailingAlignmentMask to selectively process elements.
Sparse gather/scatter via macros:
// _load4: gather 4 elements from non-contiguous positions in a dense array.
#define _load4(ps, pi) _mm_setr_ps(ps[pi[0]], ps[pi[1]], ps[pi[2]], ps[pi[3]])
// _store4: scatter 4 elements back using rotate-and-store pattern.
#define _store4(x, pd, pi) \
_mm_store_ss(pd + pi[0], x); \
x = _rotate(x); _mm_store_ss(pd + pi[1], x); \
x = _rotate(x); _mm_store_ss(pd + pi[2], x); \
x = _rotate(x); _mm_store_ss(pd + pi[3], x)
SDCA L1 Soft-Thresholding
The SdcaL1UpdateU function implements the proximal operator for L1 regularization entirely in SIMD without branching. It uses bitwise operations to extract the sign, compute the absolute value, compare against the threshold, and conditionally zero out elements:
__m128 xSign = _mm_and_ps(xd1, signMask); // extract sign bit
__m128 xd1Abs = _mm_xor_ps(xd1, xSign); // absolute value
__m128 xCond = _mm_cmpgt_ps(xd1Abs, xThreshold); // |w| > threshold?
__m128 x2 = _mm_xor_ps(xSign, xThreshold); // signed threshold
__m128 xd2 = _mm_and_ps(_mm_sub_ps(xd1, x2), xCond); // conditional result
This is equivalent to the scalar formula: pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold).