Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Dotnet Machinelearning CpuMathNative Sse

From Leeroopedia


Knowledge Sources
Domains Linear Algebra, SIMD Optimization, Native Interop
Last Updated 2026-02-09 12:00 GMT

Overview

SIMD-accelerated CPU math operations using SSE and AVX intrinsics for high-performance vector and matrix computations in the ML.NET native layer.

Description

Sse.cpp is the core native math kernel for ML.NET, providing C++ implementations of linear algebra primitives that are invoked from managed C# code via P/Invoke. Every exported function uses SSE (Streaming SIMD Extensions) intrinsics to process four single-precision floats simultaneously through 128-bit __m128 registers. The file implements the full set of vector and matrix operations required by ML.NET trainers, including matrix-vector products, element-wise arithmetic, reductions (sum, dot product, norms), and specialized routines for the SDCA (Stochastic Dual Coordinate Ascent) optimizer.

Functions follow a naming convention with suffixes that indicate memory layout assumptions:

  • U suffix: unaligned and unpadded data (most general case)
  • S suffix: sparse vector with an index array
  • P suffix: partial sparse vector (a slice of a larger sparse vector)
  • Tran suffix: the matrix is transposed (column-major interpretation)
  • A suffix: aligned and padded for SSE (16-byte alignment)
  • X suffix: aligned and padded for AVX (32-byte alignment)

Each function processes elements in blocks of 4 (SSE lane width) with a scalar tail loop for any remaining elements that do not fill a full SIMD register. Several functions, such as Scale and Sum, include alignment-aware code paths that use masking to handle partially overlapping aligned and unaligned reads at array boundaries.

Usage

These functions are called internally by ML.NET trainers whenever hardware-accelerated math is needed on .NET Standard targets. On .NET Core, the managed C# implementations in SseIntrinsics.cs and AvxIntrinsics.cs using System.Runtime.Intrinsics may be preferred. On .NET Framework and .NET Standard, the P/Invoke path through Thunk.cs into this native library is the primary execution path.

Code Reference

Source Location

Signature

// Matrix operations
EXPORT_API(void) MatMul(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulP(const float * pmat, const int * pposSrc, const float * psrc,
    int posMin, int iposMin, int iposLim, float * pdst, int crow, int ccol);
EXPORT_API(void) MatMulTran(const float * pmat, const float * psrc, float * pdst, int crow, int ccol);

// Scalar and scaling operations
EXPORT_API(void) AddScalarU(float a, float * pd, int c);
EXPORT_API(void) Scale(float a, float * pd, int c);
EXPORT_API(void) ScaleSrcU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) ScaleAddU(float a, float b, float * pd, int c);
EXPORT_API(void) AddScaleU(float a, const float * ps, float * pd, int c);
EXPORT_API(void) AddScaleCopyU(float a, const float * ps, const float * pd, float * pr, int c);
EXPORT_API(void) AddScaleSU(float a, const float * ps, const int * pi, float * pd, int c);

// Element-wise operations
EXPORT_API(void) AddU(const float * ps, float * pd, int c);
EXPORT_API(void) AddSU(const float * ps, const int * pi, float * pd, int c);
EXPORT_API(void) MulElementWiseU(const float * ps1, const float * ps2, float * pd, int c);

// Reduction operations
EXPORT_API(float) Sum(const float * pValues, int length);
EXPORT_API(float) SumSqU(const float * ps, int c);
EXPORT_API(float) SumSqDiffU(float mean, const float * ps, int c);
EXPORT_API(float) SumAbsU(const float * ps, int c);
EXPORT_API(float) SumAbsDiffU(float mean, const float * ps, int c);
EXPORT_API(float) MaxAbsU(const float * ps, int c);
EXPORT_API(float) MaxAbsDiffU(float mean, const float * ps, int c);

// Dot product and distance
EXPORT_API(float) DotU(const float * pa, const float * pb, int c);
EXPORT_API(float) DotSU(const float * pa, const float * pb, const int * pi, int c);
EXPORT_API(float) Dist2(const float * px, const float * py, int c);

// Zeroing operations
EXPORT_API(void) ZeroItemsU(float * pd, int c, const int * pindices, int cindices);
EXPORT_API(void) ZeroMatrixItemsCore(float * pd, int c, int ccol, int cfltRow,
    const int * pindices, int cindices);

// SDCA L1 regularization
EXPORT_API(void) SdcaL1UpdateU(float primalUpdate, const float * ps, float threshold,
    float * pd1, float * pd2, int c);
EXPORT_API(void) SdcaL1UpdateSU(float primalUpdate, const float * ps, const int * pi,
    float threshold, float * pd1, float * pd2, int c);

Import

// P/Invoke declarations from src/Microsoft.ML.CpuMath/Thunk.cs
using System.Runtime.InteropServices;
using System.Security;

internal static unsafe class Thunk
{
    internal const string NativePath = "CpuMathNative";

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMul(float* pmat, float* psrc, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMulP(float* pmat, int* pposSrc, float* psrc,
        int posMin, int iposMin, int iposLim, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void MatMulTran(float* pmat, float* psrc, float* pdst, int crow, int ccol);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void Scale(float a, float* pd, int c);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern float DotU(float* pa, float* pb, int c);

    [DllImport(NativePath), SuppressUnmanagedCodeSecurity]
    public static extern void SdcaL1UpdateU(float primalUpdate, float* ps,
        float threshold, float* pd1, float* pd2, int c);
    // ... additional P/Invoke declarations for all exported functions
}

I/O Contract

Inputs

MatMul

Name Type Required Description
pmat const float* Yes Pointer to row-major matrix of dimensions crow x ccol, aligned and padded to 16 bytes
psrc const float* Yes Pointer to source vector of length ccol, aligned and padded to 16 bytes
pdst float* Yes Pointer to destination vector of length crow (output buffer, overwritten)
crow int Yes Number of rows in the matrix (must be a multiple of 4)
ccol int Yes Number of columns in the matrix (must be a multiple of 4)

MatMulP (Partial Sparse)

Name Type Required Description
pmat const float* Yes Pointer to row-major matrix
pposSrc const int* Yes Array of column indices representing nonzero positions in the sparse vector
psrc const float* Yes Sparse source values corresponding to pposSrc positions
posMin int Yes Minimum position offset for indexing into pmat and psrc
iposMin int Yes Start index into pposSrc for the partial range
iposLim int Yes End index (exclusive) into pposSrc for the partial range
pdst float* Yes Destination vector of length crow (accumulated into)
crow int Yes Number of rows (must be a multiple of 4)
ccol int Yes Number of columns

Reduction Functions (Sum, SumSqU, DotU, etc.)

Name Type Required Description
ps / pValues / pa const float* Yes Source vector pointer (may be unaligned)
pb const float* Conditional Second source vector for dot product and distance operations
mean float Conditional Subtracted from each element before squaring/absolute value (used by SumSqDiffU, SumAbsDiffU, MaxAbsDiffU)
c / length int Yes Number of elements

SdcaL1UpdateU

Name Type Required Description
primalUpdate float Yes Primal variable update scaling factor
ps const float* Yes Source gradient vector
threshold float Yes L1 regularization threshold for soft-thresholding
pd1 float* Yes Weight vector (updated in-place: pd1[i] += ps[i] * primalUpdate)
pd2 float* Yes Proximal output: soft-threshold of pd1 with threshold
c int Yes Number of elements

Outputs

Name Type Description
pdst (MatMul) float* Result vector of matrix-vector product, length crow
pd (Scale, AddScalarU, etc.) float* Modified in-place destination vector
return (Sum) float Scalar sum of all elements in the vector
return (SumSqU) float Sum of squares: sum(ps[i]^2)
return (SumSqDiffU) float Sum of squared differences: sum((ps[i] - mean)^2)
return (SumAbsU) float ps[i]|)
return (MaxAbsU) float ps[i]|)
return (DotU) float Dot product: sum(pa[i] * pb[i])
return (Dist2) float Squared Euclidean distance: sum((px[i] - py[i])^2)
pd2 (SdcaL1UpdateU) float* Soft-thresholded weight vector for L1 proximal update

Usage Examples

Matrix-Vector Multiplication

// Multiply a 4x8 matrix by an 8-element vector, producing a 4-element result.
// Both matrix and source must be 16-byte aligned and padded to multiples of 4.
float mat[32] __attribute__((aligned(16)));  // 4 rows x 8 cols
float src[8]  __attribute__((aligned(16)));
float dst[4]  __attribute__((aligned(16)));
// ... populate mat and src ...
MatMul(mat, src, dst, 4, 8);
// dst now contains the 4-element result vector.

Dot Product on Unaligned Data

// Compute dot product of two unaligned float arrays.
float a[] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[] = {5.0f, 4.0f, 3.0f, 2.0f, 1.0f};
float result = DotU(a, b, 5);
// result = 1*5 + 2*4 + 3*3 + 4*2 + 5*1 = 35.0

SDCA L1 Proximal Update

// Perform SDCA L1 update: accumulate gradient into weights, then soft-threshold.
// pd1[i] += ps[i] * primalUpdate
// pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold)
float gradients[8] = { /* ... */ };
float weights[8]   = { /* ... */ };
float proximal[8]  = { /* ... */ };
SdcaL1UpdateU(0.01f, gradients, 0.001f, weights, proximal, 8);

Sparse Dot Product

// Compute dot product where pb is dense and pa is accessed via index array pi.
// result = sum(pa[pi[k]] * pb[k]) for k in [0, c)
float dense_a[1000] = { /* full feature vector */ };
float sparse_vals[3] = {0.5f, 1.2f, 0.8f};
int   sparse_idx[3]  = {10, 200, 999};
float result = DotSU(dense_a, sparse_vals, sparse_idx, 3);

Implementation Details

SSE Intrinsic Patterns

The file uses several recurring SSE patterns:

Horizontal reduction (used by Sum, DotU, SumSqU, etc.):

// Accumulate 4 partial sums in an __m128 register, then reduce to scalar.
res = _mm_hadd_ps(res, res);   // [a+b, c+d, a+b, c+d]
res = _mm_hadd_ps(res, res);   // [a+b+c+d, ...]
return _mm_cvtss_f32(res);     // extract lowest float

Absolute value via bit masking (used by SumAbsU, MaxAbsU):

// Clear the sign bit of all 4 floats simultaneously.
__m128 mask = _mm_castsi128_ps(_mm_set1_epi32(0x7FFFFFFF));
__m128 abs_val = _mm_and_ps(value, mask);

Alignment handling (used by Scale, Sum):

// Check 16-byte alignment and use masking for boundary elements.
uintptr_t misalignment = (uintptr_t)(pd) % 16;
// Use LeadingAlignmentMask / TrailingAlignmentMask to selectively process elements.

Sparse gather/scatter via macros:

// _load4: gather 4 elements from non-contiguous positions in a dense array.
#define _load4(ps, pi) _mm_setr_ps(ps[pi[0]], ps[pi[1]], ps[pi[2]], ps[pi[3]])
// _store4: scatter 4 elements back using rotate-and-store pattern.
#define _store4(x, pd, pi) \
    _mm_store_ss(pd + pi[0], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[1], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[2], x); \
    x = _rotate(x); _mm_store_ss(pd + pi[3], x)

SDCA L1 Soft-Thresholding

The SdcaL1UpdateU function implements the proximal operator for L1 regularization entirely in SIMD without branching. It uses bitwise operations to extract the sign, compute the absolute value, compare against the threshold, and conditionally zero out elements:

__m128 xSign = _mm_and_ps(xd1, signMask);        // extract sign bit
__m128 xd1Abs = _mm_xor_ps(xd1, xSign);          // absolute value
__m128 xCond = _mm_cmpgt_ps(xd1Abs, xThreshold); // |w| > threshold?
__m128 x2 = _mm_xor_ps(xSign, xThreshold);        // signed threshold
__m128 xd2 = _mm_and_ps(_mm_sub_ps(xd1, x2), xCond); // conditional result

This is equivalent to the scalar formula: pd2[i] = sign(pd1[i]) * max(0, |pd1[i]| - threshold).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment