Implementation:Sgl project Sglang CPU GEMM

Knowledge Sources	Sgl_project_Sglang
Domains	GEMM, CPU Compute
Last Updated	2026-02-10 00:00 GMT

Overview

Implements the core GEMM (General Matrix Multiply) infrastructure for CPU kernels, including VNNI format weight packing, data type conversion stubs, and the s8s8 compensation computation.

Description

gemm.cpp is the foundation of all CPU GEMM operations in the sgl-kernel library. It provides two main categories of functionality:

1. VNNI Weight Packing

The pack_vnni function converts weights to VNNI (Vector Neural Network Instructions) format for efficient AMX/AVX-512 computation:

BFloat16/Half: Converts from [N, K] to [K/2, N, 2] with a VNNI block size of 2
INT8: Converts from [N, K] to [K/4, N, 4] with a VNNI block size of 4, followed by s8s8_compensation

The s8s8_compensation function computes column-wise sums of packed int8 weights using AVX-512 _mm512_dpbusd_epi32 instructions. The compensation values are stored at offset BLOCK_N * K in the packed buffer, enabling efficient s8s8 GEMM correction at runtime (needed because Intel VNNI uses unsigned-times-signed multiplication).

2. Data Type Conversion Stubs

Two copy_stub template overloads handle bidirectional conversion between scalar types (BFloat16, Half) and float using SIMD vectorization:

copy_stub(scalar_t* out, const float* input, size) - converts float to scalar_t
copy_stub(float* out, const scalar_t* input, size) - converts scalar_t to float

Both use at::vec::Vectorized for SIMD acceleration with GCC unroll pragmas for instruction-level parallelism.

Usage

This file is included by all CPU kernel files that perform matrix multiplication (attention, MoE, QKV projection). The pack_vnni function is called during weight preprocessing, and copy_stub is used for accumulator type conversion in GEMM epilogues.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/gemm.cpp
Lines: 1-776

Signature

// S8S8 compensation for VNNI int8 GEMM
template <int BLOCK_N>
inline void s8s8_compensation(int8_t* __restrict__ packed, int K);

// VNNI format packing for BFloat16/Half
template <typename packed_t>
inline void pack_vnni(
    packed_t* __restrict__ packed,
    const packed_t* __restrict__ weight,
    int N, int K);

// VNNI format packing specialization for int8
template <>
inline void pack_vnni<int8_t>(
    int8_t* __restrict__ packed,
    const int8_t* __restrict__ weight,
    int N, int K);

// Float-to-scalar copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
    scalar_t* __restrict__ out,
    const float* __restrict__ input,
    int64_t size);

// Scalar-to-float copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
    float* __restrict__ out,
    const scalar_t* __restrict__ input,
    int64_t size);

Import

#include "gemm.h"
#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
weight	packed_t*	Yes	Source weight tensor in row-major [N, K] layout
N	int	Yes	Number of output channels (rows in weight matrix)
K	int	Yes	Number of input channels (columns in weight matrix)
input	float* / scalar_t*	Yes (for copy_stub)	Source data for type conversion
size	int64_t	Yes (for copy_stub)	Number of elements to convert

Outputs

Name	Type	Description
packed	packed_t*	VNNI-packed weight buffer: [K/2, N, 2] for BF16/FP16 or [K/4, N, 4] + compensation for INT8
out	scalar_t* / float*	Type-converted output for copy_stub

Usage Examples

Pack BFloat16 Weights to VNNI Format

// Convert weight from [N, K] to VNNI format [K/2, N, 2]
at::BFloat16* packed_weight = allocate_packed_buffer(N, K);
pack_vnni<at::BFloat16>(packed_weight, weight_ptr, N, K);

Pack INT8 Weights with S8S8 Compensation

// Convert int8 weight to VNNI format [K/4, N, 4] with compensation
// Buffer must have extra space: BLOCK_N * K + BLOCK_N * sizeof(int32_t)
int8_t* packed_weight = allocate_packed_buffer_int8(BLOCK_N, K);
pack_vnni<int8_t>(packed_weight, weight_ptr, BLOCK_N, K);
// Compensation data is automatically appended at offset BLOCK_N * K

Type Conversion with copy_stub

// Convert float accumulator to BFloat16 output
float accum[1024];
at::BFloat16 output[1024];
copy_stub<at::BFloat16>(output, accum, 1024);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment