Implementation:Sgl project Sglang CPU BMM
| Knowledge Sources | |
|---|---|
| Domains | Kernel, Linear Algebra, CPU Optimization |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
CPU-optimized batched matrix multiplication (BMM) kernel using Intel AMX via the brgemm interface for efficient tiled GEMM computation.
Description
The bmm_cpu function implements a high-performance batched matrix multiplication for CPU inference. It takes batched inputs mat1 with shape [B, M, K] and mat2 with shape [B, N, K] (with mat2 in VNNI-packed format) and produces output [B, M, N]. The problem is decomposed into tiles using BLOCK_M and BLOCK_N (derived from AMX tile sizes), then parallelized across the 3D index space [B, MB, NB] using at::parallel_for. Each tile is computed via tinygemm_kernel, which leverages Intel AMX brgemm when the M dimension is large enough (controlled by can_use_brgemm). The weight matrix is automatically converted to VNNI packed format if not already packed. The kernel uses float32 accumulation buffers (Ctmp) for numerical accuracy, and dispatches for reduced floating-point types (BFloat16/Half).
Usage
Use this kernel for CPU-based LLM serving where batched matrix multiplications are needed, particularly in multi-head attention score computation on Intel Xeon processors with AMX support.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/bmm.cpp
- Lines: 1-123
Signature
// Internal kernel template
template <typename scalar_t>
void bmm_kernel_impl(
scalar_t* __restrict__ out,
const scalar_t* __restrict__ mat1,
const scalar_t* __restrict__ mat2,
int64_t B, int64_t M, int64_t N, int64_t K,
int64_t mat1_strideB, int64_t mat1_strideM,
int64_t out_strideB, int64_t out_strideM,
float scale = 0.f);
// Public entry point
void bmm_cpu(
at::Tensor& out, at::Tensor& mat1, at::Tensor& mat2,
bool is_vnni, const std::optional<at::Tensor>& scale);
Import
#include "common.h"
#include "gemm.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| out | at::Tensor& | Yes | Output tensor of shape [B, M, N], pre-allocated |
| mat1 | at::Tensor& | Yes | Input tensor of shape [B, M, K], last dim contiguous |
| mat2 | at::Tensor& | Yes | Weight tensor of shape [B, N, K], contiguous |
| is_vnni | bool | Yes | Whether mat2 is already in VNNI-packed format |
| scale | std::optional<at::Tensor> | No | Per-tensor quantization scale (currently unsupported) |
Outputs
| Name | Type | Description |
|---|---|---|
| out | at::Tensor& | Result of batched matmul written in-place, shape [B, M, N] |
Usage Examples
// mat1: [B, M, K], mat2: [B, N, K], out: [B, M, N]
at::Tensor out = at::empty({B, M, N}, mat1.options());
at::Tensor mat1 = /* ... */;
at::Tensor mat2 = /* ... */;
bmm_cpu(out, mat1, mat2, /*is_vnni=*/false, std::nullopt);