Implementation:Dotnet Machinelearning CpuMathNative Sse

Knowledge Sources	Dotnet_Machinelearning
Domains	Linear Algebra, SIMD Optimization, Native Interop
Last Updated	2026-02-09 12:00 GMT

Overview

SIMD-accelerated CPU math operations using SSE and AVX intrinsics for high-performance vector and matrix computations in the ML.NET native layer.

Description

Sse.cpp is the core native math kernel for ML.NET, providing C++ implementations of linear algebra primitives that are invoked from managed C# code via P/Invoke. Every exported function uses SSE (Streaming SIMD Extensions) intrinsics to process four single-precision floats simultaneously through 128-bit __m128 registers. The file implements the full set of vector and matrix operations required by ML.NET trainers, including matrix-vector products, element-wise arithmetic, reductions (sum, dot product, norms), and specialized routines for the SDCA (Stochastic Dual Coordinate Ascent) optimizer.

Functions follow a naming convention with suffixes that indicate memory layout assumptions:

U suffix: unaligned and unpadded data (most general case)
S suffix: sparse vector with an index array
P suffix: partial sparse vector (a slice of a larger sparse vector)
Tran suffix: the matrix is transposed (column-major interpretation)
A suffix: aligned and padded for SSE (16-byte alignment)
X suffix: aligned and padded for AVX (32-byte alignment)

Each function processes elements in blocks of 4 (SSE lane width) with a scalar tail loop for any remaining elements that do not fill a full SIMD register. Several functions, such as Scale and Sum, include alignment-aware code paths that use masking to handle partially overlapping aligned and unaligned reads at array boundaries.

Usage

These functions are called internally by ML.NET trainers whenever hardware-accelerated math is needed on .NET Standard targets. On .NET Core, the managed C# implementations in SseIntrinsics.cs and AvxIntrinsics.cs using System.Runtime.Intrinsics may be preferred. On .NET Framework and .NET Standard, the P/Invoke path through Thunk.cs into this native library is the primary execution path.

Code Reference

Source Location

Repository: Dotnet_Machinelearning
File: src/Native/CpuMathNative/Sse.cpp
Lines: 1-887

Signature

// Matrix operations
EXPORT_API(void) MatMul(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulP(const float * pmat, const int * pposSrc, const float * psrc,
    int posMin, int iposMin, int iposLim, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulTran(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);

// Scalar and scaling operations
EXPORT_API(void) AddScalarU(float a, float * pd, int c);
EXPORT_API(void) Scale(float a, float * pd, int c);
EXPORT_API(void) ScaleSrcU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) ScaleAddU(float a, float b, float * pd, int c);
EXPORT_API(void) AddScaleU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) AddScaleCopyU(float a, const float * ps, const float * pd, float * pr, int c);
EXPORT_API(void) AddScaleSU(float a, const float * ps, const int * pi, float * pd, int c);

// Element-wise operations
EXPORT_API(void) AddU(const float * ps, float * pd, int c);
EXPORT_API(void) AddSU(const float * ps, const int * pi, float * pd, int c);
EXPORT_API(void) MulElementWiseU(const float * ps1, const float * ps2, float * pd, int c);

// Reduction operations
EXPORT_API(float) Sum(const float * pValues, int length);
EXPORT_API(float) SumSqU(const float * ps, int c);
EXPORT_API(float) SumSqDiffU(float mean, const float * ps, int c);
EXPORT_API(float) SumAbsU(const float * ps, int c);
EXPORT_API(float) SumAbsDiffU(float mean, const float * ps, int c);
EXPORT_API(float) MaxAbsU(const float * ps, int c);
EXPORT_API(float) MaxAbsDiffU(float mean, const float * ps, int c);

// Dot product and distance
EXPORT_API(float) DotU(const float * pa, const float * pb, int c);
EXPORT_API(float) DotSU(const float * pa, const float * pb, const int * pi, int c);
EXPORT_API(float) Dist2(const float * px, const float * py, int c);

// Zeroing operations
EXPORT_API(void) ZeroItemsU(float * pd, int c, const int * pindices, int cindices);
EXPORT_API(void) ZeroMatrixItemsCore(float * pd, int c, int ccol, int cfltRow,
    const int * pindices, int cindices);

// SDCA L1 regularization
EXPORT_API(void) SdcaL1UpdateU(float primalUpdate, const float * ps, float threshold,
    float * pd1, float * pd2, int c);
EXPORT_API(void) SdcaL1UpdateSU(float primalUpdate, const float * ps, const int * pi,
    float threshold, float * pd1, float * pd2, int c);

Import

// P/Invoke declarations from src/Microsoft.ML.CpuMath/Thunk.cs
using System.Runtime.InteropServices;
using System.Security;

internal static unsafe class Thunk
{
    internal const string NativePath = "CpuMathNative";

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMul(float* pmat, float* psrc, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMulP(float* pmat, int* pposSrc, float* psrc,
        int posMin, int iposMin, int iposLim, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMulTran(float* pmat, float* psrc, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void Scale(float a, float* pd, int c);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern float DotU(float* pa, float* pb, int c);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void SdcaL1UpdateU(float primalUpdate, float* ps,
        float threshold, float* pd1, float* pd2, int c);
    // ... additional P/Invoke declarations for all exported functions
}

I/O Contract

Inputs

MatMul

Name	Type	Required	Description
pmat	const float*	Yes	Pointer to row-major matrix of dimensions crow x ccol, aligned and padded to 16 bytes
psrc	const float*	Yes	Pointer to source vector of length ccol, aligned and padded to 16 bytes
pdst	float*	Yes	Pointer to destination vector of length crow (output buffer, overwritten)
crow	int	Yes	Number of rows in the matrix (must be a multiple of 4)
ccol	int	Yes	Number of columns in the matrix (must be a multiple of 4)

MatMulP (Partial Sparse)

Name	Type	Required	Description
pmat	const float*	Yes	Pointer to row-major matrix
pposSrc	const int*	Yes	Array of column indices representing nonzero positions in the sparse vector
psrc	const float*	Yes	Sparse source values corresponding to pposSrc positions
posMin	int	Yes	Minimum position offset for indexing into pmat and psrc
iposMin	int	Yes	Start index into pposSrc for the partial range
iposLim	int	Yes	End index (exclusive) into pposSrc for the partial range
pdst	float*	Yes	Destination vector of length crow (accumulated into)
crow	int	Yes	Number of rows (must be a multiple of 4)
ccol	int	Yes	Number of columns

Reduction Functions (Sum, SumSqU, DotU, etc.)

Name	Type	Required	Description
ps / pValues / pa	const float*	Yes	Source vector pointer (may be unaligned)
pb	const float*	Conditional	Second source vector for dot product and distance operations
mean	float	Conditional	Subtracted from each element before squaring/absolute value (used by SumSqDiffU, SumAbsDiffU, MaxAbsDiffU)
c / length	int	Yes	Number of elements

SdcaL1UpdateU

Name	Type	Required	Description
primalUpdate	float	Yes	Primal variable update scaling factor
ps	const float*	Yes	Source gradient vector
threshold	float	Yes	L1 regularization threshold for soft-thresholding
pd1	float*	Yes	Weight vector (updated in-place: pd1[i] += ps[i] * primalUpdate)
pd2	float*	Yes	Proximal output: soft-threshold of pd1 with threshold
c	int	Yes	Number of elements

Outputs

Name	Type	Description
pdst (MatMul)	float*	Result vector of matrix-vector product, length crow
pd (Scale, AddScalarU, etc.)	float*	Modified in-place destination vector
return (Sum)	float	Scalar sum of all elements in the vector
return (SumSqU)	float	Sum of squares: sum(ps[i]^2)
return (SumSqDiffU)	float	Sum of squared differences: sum((ps[i] - mean)^2)
return (SumAbsU)	float	ps[i]\|)
return (MaxAbsU)	float	ps[i]\|)
return (DotU)	float	Dot product: sum(pa[i] * pb[i])
return (Dist2)	float	Squared Euclidean distance: sum((px[i] - py[i])^2)
pd2 (SdcaL1UpdateU)	float*	Soft-thresholded weight vector for L1 proximal update

Usage Examples

Matrix-Vector Multiplication

// Multiply a 4x8 matrix by an 8-element vector, producing a 4-element result.
// Both matrix and source must be 16-byte aligned and padded to multiples of 4.
float mat[32] __attribute__((aligned(16)));  // 4 rows x 8 cols
float src[8]  __attribute__((aligned(16)));
float dst[4]  __attribute__((aligned(16)));
// ... populate mat and src ...
MatMul(mat, src, dst, 4, 8);
// dst now contains the 4-element result vector.

Dot Product on Unaligned Data

// Compute dot product of two unaligned float arrays.
float a[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[] = {5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
float result = DotU(a, b, 5);
// result = 1*5 + 2*4 + 3*3 + 4*2 + 5*1 = 35.0

SDCA L1 Proximal Update

// Perform SDCA L1 update: accumulate gradient into weights, then soft-threshold.
// pd1[i] += ps[i] * primalUpdate
// pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold)
float gradients[8] = { /* ... */ };
float weights[8]   = { /* ... */ };
float proximal[8]  = { /* ... */ };
SdcaL1UpdateU(0.01f, gradients, 0.001f, weights, proximal, 8);

Sparse Dot Product

// Compute dot product where pb is dense and pa is accessed via index array pi.
// result = sum(pa[pi[k]] * pb[k]) for k in [0, c)
float dense_a[1000] = { /* full feature vector */ };
float sparse_vals[3] = {0.5f, 1.2f, 0.8f};
int   sparse_idx[3]  = {10, 200, 999};
float result = DotSU(dense_a, sparse_vals, sparse_idx, 3);

Implementation Details

SSE Intrinsic Patterns

The file uses several recurring SSE patterns:

Horizontal reduction (used by Sum, DotU, SumSqU, etc.):

// Accumulate 4 partial sums in an __m128 register, then reduce to scalar.
res = _mm_hadd_ps(res, res);   // [a+b, c+d, a+b, c+d]
res = _mm_hadd_ps(res, res);   // [a+b+c+d, ...]
return _mm_cvtss_f32(res);     // extract lowest float

Absolute value via bit masking (used by SumAbsU, MaxAbsU):

// Clear the sign bit of all 4 floats simultaneously.
__m128 mask = _mm_castsi128_ps(_mm_set1_epi32(0x7FFFFFFF));
__m128 abs_val = _mm_and_ps(value, mask);

Alignment handling (used by Scale, Sum):

// Check 16-byte alignment and use masking for boundary elements.
uintptr_t misalignment = (uintptr_t)(pd) % 16;
// Use LeadingAlignmentMask / TrailingAlignmentMask to selectively process elements.

Sparse gather/scatter via macros:

// _load4: gather 4 elements from non-contiguous positions in a dense array.
#define _load4(ps, pi) _mm_setr_ps(ps[pi[0]], ps[pi[1]], ps[pi[2]], ps[pi[3]])
// _store4: scatter 4 elements back using rotate-and-store pattern.
#define _store4(x, pd, pi) \
    _mm_store_ss(pd + pi[0], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[1], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[2], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[3], x)

SDCA L1 Soft-Thresholding

The SdcaL1UpdateU function implements the proximal operator for L1 regularization entirely in SIMD without branching. It uses bitwise operations to extract the sign, compute the absolute value, compare against the threshold, and conditionally zero out elements:

__m128 xSign = _mm_and_ps(xd1, signMask);        // extract sign bit
__m128 xd1Abs = _mm_xor_ps(xd1, xSign);          // absolute value
__m128 xCond = _mm_cmpgt_ps(xd1Abs, xThreshold); // |w| > threshold?
__m128 x2 = _mm_xor_ps(xSign, xThreshold);        // signed threshold
__m128 xd2 = _mm_and_ps(_mm_sub_ps(xd1, x2), xCond); // conditional result

This is equivalent to the scalar formula: pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment