Implementation:Vllm project Vllm SGL GEMM

Knowledge Sources	vllm
Domains	CPU_Inference, GEMM, Quantization
Last Updated	2026-02-08 00:00 GMT

Overview

Implements VNNI-format weight packing and BF16/FP16 GEMM kernels using AVX-512 AMX intrinsics, adapted from SGLang for vLLM CPU inference.

Description

This file provides the core BF16/FP16 matrix multiplication infrastructure for CPU-accelerated inference. The pack_vnni function converts weight matrices from standard [N, K] layout to VNNI-friendly [K/vnni_blk, N, vnni_blk] format optimized for Intel AMX tile operations, with an INT8 specialization that also computes the s8s8 compensation term. The tinygemm_kernel_nn template implements the inner GEMM kernel using AVX-512 BF16 dot-product instructions (_mm512_dpbf16_ps), with configurable tile sizes and optional bias fusion.

The convert_weight_packed function provides the public API for weight prepacking (supporting 2D and 3D MoE weight tensors), while weight_packed_linear performs the complete linear operation with optional brgemm acceleration via ATen's CPUBLAS backend.

Usage

This code is compiled as part of the vLLM SGL-kernels CPU extension. It is invoked for BF16/FP16 linear layers during CPU inference, including as a building block for the fused MoE kernel.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/sgl-kernels/gemm.cpp
Lines: 1-464

Signature

template <typename packed_t>
inline void pack_vnni(
    packed_t* packed, const packed_t* weight, int N, int K);

template <int BLOCK_N>
inline void s8s8_compensation(int8_t* packed, int K);

at::Tensor convert_weight_packed(at::Tensor& weight);

at::Tensor weight_packed_linear(
    at::Tensor& mat1, at::Tensor& mat2,
    const std::optional<at::Tensor>& bias, bool is_vnni);

template <typename scalar_t>
void tinygemm_kernel(
    const scalar_t* A, const scalar_t* B, scalar_t* C,
    float* Ctmp, int64_t M, int64_t N, int64_t K,
    int64_t lda, int64_t ldb, int64_t ldc, bool brg);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name	Type	Required	Description
mat1	at::Tensor [M, K]	Yes	Left-hand activation matrix (BFloat16 or Half)
mat2 / weight	at::Tensor [N, K] or [E, N, K]	Yes	Weight matrix in standard layout; will be packed to VNNI format if is_vnni is false
bias	at::Tensor [N] (float)	No	Optional bias vector added to the output
is_vnni	bool	Yes	If true, mat2 is already in VNNI-packed format; if false, packing is performed automatically

Outputs

Name	Type	Description
out	at::Tensor [M, N]	Result of the matrix multiplication in the same dtype as mat1
packed_weight	at::Tensor	VNNI-format packed weight tensor from convert_weight_packed (layout: [K/2, N, 2] for BF16/FP16)

Usage Examples

// Pack weights to VNNI format for reuse
at::Tensor packed_w = convert_weight_packed(weight);  // [N, K] -> VNNI layout

// Perform packed linear operation
at::Tensor output = weight_packed_linear(
    activations,   // [M, K] BFloat16
    packed_w,      // VNNI-packed weights
    bias,          // optional [N] float bias
    /*is_vnni=*/true);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment