Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU BMM

From Leeroopedia


Knowledge Sources
Domains Kernel, Linear Algebra, CPU Optimization
Last Updated 2026-02-10 00:00 GMT

Overview

CPU-optimized batched matrix multiplication (BMM) kernel using Intel AMX via the brgemm interface for efficient tiled GEMM computation.

Description

The bmm_cpu function implements a high-performance batched matrix multiplication for CPU inference. It takes batched inputs mat1 with shape [B, M, K] and mat2 with shape [B, N, K] (with mat2 in VNNI-packed format) and produces output [B, M, N]. The problem is decomposed into tiles using BLOCK_M and BLOCK_N (derived from AMX tile sizes), then parallelized across the 3D index space [B, MB, NB] using at::parallel_for. Each tile is computed via tinygemm_kernel, which leverages Intel AMX brgemm when the M dimension is large enough (controlled by can_use_brgemm). The weight matrix is automatically converted to VNNI packed format if not already packed. The kernel uses float32 accumulation buffers (Ctmp) for numerical accuracy, and dispatches for reduced floating-point types (BFloat16/Half).

Usage

Use this kernel for CPU-based LLM serving where batched matrix multiplications are needed, particularly in multi-head attention score computation on Intel Xeon processors with AMX support.

Code Reference

Source Location

Signature

// Internal kernel template
template <typename scalar_t>
void bmm_kernel_impl(
    scalar_t* __restrict__ out,
    const scalar_t* __restrict__ mat1,
    const scalar_t* __restrict__ mat2,
    int64_t B, int64_t M, int64_t N, int64_t K,
    int64_t mat1_strideB, int64_t mat1_strideM,
    int64_t out_strideB, int64_t out_strideM,
    float scale = 0.f);

// Public entry point
void bmm_cpu(
    at::Tensor& out, at::Tensor& mat1, at::Tensor& mat2,
    bool is_vnni, const std::optional<at::Tensor>& scale);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
out at::Tensor& Yes Output tensor of shape [B, M, N], pre-allocated
mat1 at::Tensor& Yes Input tensor of shape [B, M, K], last dim contiguous
mat2 at::Tensor& Yes Weight tensor of shape [B, N, K], contiguous
is_vnni bool Yes Whether mat2 is already in VNNI-packed format
scale std::optional<at::Tensor> No Per-tensor quantization scale (currently unsupported)

Outputs

Name Type Description
out at::Tensor& Result of batched matmul written in-place, shape [B, M, N]

Usage Examples

// mat1: [B, M, K], mat2: [B, N, K], out: [B, M, N]
at::Tensor out = at::empty({B, M, N}, mat1.options());
at::Tensor mat1 = /* ... */;
at::Tensor mat2 = /* ... */;
bmm_cpu(out, mat1, mat2, /*is_vnni=*/false, std::nullopt);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment