Implementation:Sgl project Sglang CPU GEMM
| Knowledge Sources | |
|---|---|
| Domains | GEMM, CPU Compute |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements the core GEMM (General Matrix Multiply) infrastructure for CPU kernels, including VNNI format weight packing, data type conversion stubs, and the s8s8 compensation computation.
Description
gemm.cpp is the foundation of all CPU GEMM operations in the sgl-kernel library. It provides two main categories of functionality:
1. VNNI Weight Packing
The pack_vnni function converts weights to VNNI (Vector Neural Network Instructions) format for efficient AMX/AVX-512 computation:
- BFloat16/Half: Converts from [N, K] to [K/2, N, 2] with a VNNI block size of 2
- INT8: Converts from [N, K] to [K/4, N, 4] with a VNNI block size of 4, followed by s8s8_compensation
The s8s8_compensation function computes column-wise sums of packed int8 weights using AVX-512 _mm512_dpbusd_epi32 instructions. The compensation values are stored at offset BLOCK_N * K in the packed buffer, enabling efficient s8s8 GEMM correction at runtime (needed because Intel VNNI uses unsigned-times-signed multiplication).
2. Data Type Conversion Stubs
Two copy_stub template overloads handle bidirectional conversion between scalar types (BFloat16, Half) and float using SIMD vectorization:
- copy_stub(scalar_t* out, const float* input, size) - converts float to scalar_t
- copy_stub(float* out, const scalar_t* input, size) - converts scalar_t to float
Both use at::vec::Vectorized for SIMD acceleration with GCC unroll pragmas for instruction-level parallelism.
Usage
This file is included by all CPU kernel files that perform matrix multiplication (attention, MoE, QKV projection). The pack_vnni function is called during weight preprocessing, and copy_stub is used for accumulator type conversion in GEMM epilogues.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/gemm.cpp
- Lines: 1-776
Signature
// S8S8 compensation for VNNI int8 GEMM
template <int BLOCK_N>
inline void s8s8_compensation(int8_t* __restrict__ packed, int K);
// VNNI format packing for BFloat16/Half
template <typename packed_t>
inline void pack_vnni(
packed_t* __restrict__ packed,
const packed_t* __restrict__ weight,
int N, int K);
// VNNI format packing specialization for int8
template <>
inline void pack_vnni<int8_t>(
int8_t* __restrict__ packed,
const int8_t* __restrict__ weight,
int N, int K);
// Float-to-scalar copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
scalar_t* __restrict__ out,
const float* __restrict__ input,
int64_t size);
// Scalar-to-float copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
float* __restrict__ out,
const scalar_t* __restrict__ input,
int64_t size);
Import
#include "gemm.h"
#include "common.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| weight | packed_t* | Yes | Source weight tensor in row-major [N, K] layout |
| N | int | Yes | Number of output channels (rows in weight matrix) |
| K | int | Yes | Number of input channels (columns in weight matrix) |
| input | float* / scalar_t* | Yes (for copy_stub) | Source data for type conversion |
| size | int64_t | Yes (for copy_stub) | Number of elements to convert |
Outputs
| Name | Type | Description |
|---|---|---|
| packed | packed_t* | VNNI-packed weight buffer: [K/2, N, 2] for BF16/FP16 or [K/4, N, 4] + compensation for INT8 |
| out | scalar_t* / float* | Type-converted output for copy_stub |
Usage Examples
Pack BFloat16 Weights to VNNI Format
// Convert weight from [N, K] to VNNI format [K/2, N, 2]
at::BFloat16* packed_weight = allocate_packed_buffer(N, K);
pack_vnni<at::BFloat16>(packed_weight, weight_ptr, N, K);
Pack INT8 Weights with S8S8 Compensation
// Convert int8 weight to VNNI format [K/4, N, 4] with compensation
// Buffer must have extra space: BLOCK_N * K + BLOCK_N * sizeof(int32_t)
int8_t* packed_weight = allocate_packed_buffer_int8(BLOCK_N, K);
pack_vnni<int8_t>(packed_weight, weight_ptr, BLOCK_N, K);
// Compensation data is automatically appended at offset BLOCK_N * K
Type Conversion with copy_stub
// Convert float accumulator to BFloat16 output
float accum[1024];
at::BFloat16 output[1024];
copy_stub<at::BFloat16>(output, accum, 1024);