Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU GEMM

From Leeroopedia


Knowledge Sources
Domains GEMM, CPU Compute
Last Updated 2026-02-10 00:00 GMT

Overview

Implements the core GEMM (General Matrix Multiply) infrastructure for CPU kernels, including VNNI format weight packing, data type conversion stubs, and the s8s8 compensation computation.

Description

gemm.cpp is the foundation of all CPU GEMM operations in the sgl-kernel library. It provides two main categories of functionality:

1. VNNI Weight Packing

The pack_vnni function converts weights to VNNI (Vector Neural Network Instructions) format for efficient AMX/AVX-512 computation:

  • BFloat16/Half: Converts from [N, K] to [K/2, N, 2] with a VNNI block size of 2
  • INT8: Converts from [N, K] to [K/4, N, 4] with a VNNI block size of 4, followed by s8s8_compensation

The s8s8_compensation function computes column-wise sums of packed int8 weights using AVX-512 _mm512_dpbusd_epi32 instructions. The compensation values are stored at offset BLOCK_N * K in the packed buffer, enabling efficient s8s8 GEMM correction at runtime (needed because Intel VNNI uses unsigned-times-signed multiplication).

2. Data Type Conversion Stubs

Two copy_stub template overloads handle bidirectional conversion between scalar types (BFloat16, Half) and float using SIMD vectorization:

  • copy_stub(scalar_t* out, const float* input, size) - converts float to scalar_t
  • copy_stub(float* out, const scalar_t* input, size) - converts scalar_t to float

Both use at::vec::Vectorized for SIMD acceleration with GCC unroll pragmas for instruction-level parallelism.

Usage

This file is included by all CPU kernel files that perform matrix multiplication (attention, MoE, QKV projection). The pack_vnni function is called during weight preprocessing, and copy_stub is used for accumulator type conversion in GEMM epilogues.

Code Reference

Source Location

Signature

// S8S8 compensation for VNNI int8 GEMM
template <int BLOCK_N>
inline void s8s8_compensation(int8_t* __restrict__ packed, int K);

// VNNI format packing for BFloat16/Half
template <typename packed_t>
inline void pack_vnni(
    packed_t* __restrict__ packed,
    const packed_t* __restrict__ weight,
    int N, int K);

// VNNI format packing specialization for int8
template <>
inline void pack_vnni<int8_t>(
    int8_t* __restrict__ packed,
    const int8_t* __restrict__ weight,
    int N, int K);

// Float-to-scalar copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
    scalar_t* __restrict__ out,
    const float* __restrict__ input,
    int64_t size);

// Scalar-to-float copy with SIMD vectorization
template <typename scalar_t>
inline void copy_stub(
    float* __restrict__ out,
    const scalar_t* __restrict__ input,
    int64_t size);

Import

#include "gemm.h"
#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
weight packed_t* Yes Source weight tensor in row-major [N, K] layout
N int Yes Number of output channels (rows in weight matrix)
K int Yes Number of input channels (columns in weight matrix)
input float* / scalar_t* Yes (for copy_stub) Source data for type conversion
size int64_t Yes (for copy_stub) Number of elements to convert

Outputs

Name Type Description
packed packed_t* VNNI-packed weight buffer: [K/2, N, 2] for BF16/FP16 or [K/4, N, 4] + compensation for INT8
out scalar_t* / float* Type-converted output for copy_stub

Usage Examples

Pack BFloat16 Weights to VNNI Format

// Convert weight from [N, K] to VNNI format [K/2, N, 2]
at::BFloat16* packed_weight = allocate_packed_buffer(N, K);
pack_vnni<at::BFloat16>(packed_weight, weight_ptr, N, K);

Pack INT8 Weights with S8S8 Compensation

// Convert int8 weight to VNNI format [K/4, N, 4] with compensation
// Buffer must have extra space: BLOCK_N * K + BLOCK_N * sizeof(int32_t)
int8_t* packed_weight = allocate_packed_buffer_int8(BLOCK_N, K);
pack_vnni<int8_t>(packed_weight, weight_ptr, BLOCK_N, K);
// Compensation data is automatically appended at offset BLOCK_N * K

Type Conversion with copy_stub

// Convert float accumulator to BFloat16 output
float accum[1024];
at::BFloat16 output[1024];
copy_stub<at::BFloat16>(output, accum, 1024);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment