Implementation:Sgl project Sglang CPU BMM

Knowledge Sources	Sgl_project_Sglang
Domains	Kernel, Linear Algebra, CPU Optimization
Last Updated	2026-02-10 00:00 GMT

Overview

CPU-optimized batched matrix multiplication (BMM) kernel using Intel AMX via the brgemm interface for efficient tiled GEMM computation.

Description

The bmm_cpu function implements a high-performance batched matrix multiplication for CPU inference. It takes batched inputs mat1 with shape [B, M, K] and mat2 with shape [B, N, K] (with mat2 in VNNI-packed format) and produces output [B, M, N]. The problem is decomposed into tiles using BLOCK_M and BLOCK_N (derived from AMX tile sizes), then parallelized across the 3D index space [B, MB, NB] using at::parallel_for. Each tile is computed via tinygemm_kernel, which leverages Intel AMX brgemm when the M dimension is large enough (controlled by can_use_brgemm). The weight matrix is automatically converted to VNNI packed format if not already packed. The kernel uses float32 accumulation buffers (Ctmp) for numerical accuracy, and dispatches for reduced floating-point types (BFloat16/Half).

Usage

Use this kernel for CPU-based LLM serving where batched matrix multiplications are needed, particularly in multi-head attention score computation on Intel Xeon processors with AMX support.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/bmm.cpp
Lines: 1-123

Signature

// Internal kernel template
template <typename scalar_t>
void bmm_kernel_impl(
    scalar_t* __restrict__ out,
    const scalar_t* __restrict__ mat1,
    const scalar_t* __restrict__ mat2,
    int64_t B, int64_t M, int64_t N, int64_t K,
    int64_t mat1_strideB, int64_t mat1_strideM,
    int64_t out_strideB, int64_t out_strideM,
    float scale = 0.f);

// Public entry point
void bmm_cpu(
    at::Tensor& out, at::Tensor& mat1, at::Tensor& mat2,
    bool is_vnni, const std::optional<at::Tensor>& scale);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
out	at::Tensor&	Yes	Output tensor of shape [B, M, N], pre-allocated
mat1	at::Tensor&	Yes	Input tensor of shape [B, M, K], last dim contiguous
mat2	at::Tensor&	Yes	Weight tensor of shape [B, N, K], contiguous
is_vnni	bool	Yes	Whether mat2 is already in VNNI-packed format
scale	std::optional<at::Tensor>	No	Per-tensor quantization scale (currently unsupported)

Outputs

Name	Type	Description
out	at::Tensor&	Result of batched matmul written in-place, shape [B, M, N]

Usage Examples

// mat1: [B, M, K], mat2: [B, N, K], out: [B, M, N]
at::Tensor out = at::empty({B, M, N}, mat1.options());
at::Tensor mat1 = /* ... */;
at::Tensor mat2 = /* ... */;
bmm_cpu(out, mat1, mat2, /*is_vnni=*/false, std::nullopt);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment