Implementation:Vllm project Vllm SGL GEMM
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, GEMM, Quantization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements VNNI-format weight packing and BF16/FP16 GEMM kernels using AVX-512 AMX intrinsics, adapted from SGLang for vLLM CPU inference.
Description
This file provides the core BF16/FP16 matrix multiplication infrastructure for CPU-accelerated inference. The pack_vnni function converts weight matrices from standard [N, K] layout to VNNI-friendly [K/vnni_blk, N, vnni_blk] format optimized for Intel AMX tile operations, with an INT8 specialization that also computes the s8s8 compensation term. The tinygemm_kernel_nn template implements the inner GEMM kernel using AVX-512 BF16 dot-product instructions (_mm512_dpbf16_ps), with configurable tile sizes and optional bias fusion.
The convert_weight_packed function provides the public API for weight prepacking (supporting 2D and 3D MoE weight tensors), while weight_packed_linear performs the complete linear operation with optional brgemm acceleration via ATen's CPUBLAS backend.
Usage
This code is compiled as part of the vLLM SGL-kernels CPU extension. It is invoked for BF16/FP16 linear layers during CPU inference, including as a building block for the fused MoE kernel.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/sgl-kernels/gemm.cpp
- Lines: 1-464
Signature
template <typename packed_t>
inline void pack_vnni(
packed_t* packed, const packed_t* weight, int N, int K);
template <int BLOCK_N>
inline void s8s8_compensation(int8_t* packed, int K);
at::Tensor convert_weight_packed(at::Tensor& weight);
at::Tensor weight_packed_linear(
at::Tensor& mat1, at::Tensor& mat2,
const std::optional<at::Tensor>& bias, bool is_vnni);
template <typename scalar_t>
void tinygemm_kernel(
const scalar_t* A, const scalar_t* B, scalar_t* C,
float* Ctmp, int64_t M, int64_t N, int64_t K,
int64_t lda, int64_t ldb, int64_t ldc, bool brg);
Import
#include "common.h"
#include "vec.h"
#include "gemm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mat1 | at::Tensor [M, K] | Yes | Left-hand activation matrix (BFloat16 or Half) |
| mat2 / weight | at::Tensor [N, K] or [E, N, K] | Yes | Weight matrix in standard layout; will be packed to VNNI format if is_vnni is false |
| bias | at::Tensor [N] (float) | No | Optional bias vector added to the output |
| is_vnni | bool | Yes | If true, mat2 is already in VNNI-packed format; if false, packing is performed automatically |
Outputs
| Name | Type | Description |
|---|---|---|
| out | at::Tensor [M, N] | Result of the matrix multiplication in the same dtype as mat1 |
| packed_weight | at::Tensor | VNNI-format packed weight tensor from convert_weight_packed (layout: [K/2, N, 2] for BF16/FP16) |
Usage Examples
// Pack weights to VNNI format for reuse
at::Tensor packed_w = convert_weight_packed(weight); // [N, K] -> VNNI layout
// Perform packed linear operation
at::Tensor output = weight_packed_linear(
activations, // [M, K] BFloat16
packed_w, // VNNI-packed weights
bias, // optional [N] float bias
/*is_vnni=*/true);