Implementation:Sgl project Sglang CPU GEMM INT8
| Knowledge Sources | |
|---|---|
| Domains | GEMM, Quantization, CPU Compute |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements INT8 weight-only and W8A8 (8-bit weight, 8-bit activation) quantized GEMM for CPU inference using Intel AMX int8 instructions.
Description
gemm_int8.cpp provides the CPU implementation of INT8 quantized GEMM with approximately 2x memory savings and improved throughput through integer arithmetic. INT8 quantization offers one of the best accuracy-performance tradeoffs among quantization schemes.
The implementation consists of two main components:
1. Output Dequantization (scale_C)
The scale_C struct template handles post-GEMM output rescaling:
- Subtracts s8s8 compensation (Bcomp) from int32 accumulation results to correct for the signed-unsigned mismatch in Intel VNNI instructions
- Converts int32 to float via _mm512_cvtepi32_ps
- Applies per-token activation scale (As) and per-channel weight scales (Bs) via _mm512_mul_ps
- Optionally fuses bias addition when has_bias is true
- Converts float back to BFloat16 using _mm512_cvtne2ps_pbh for 512-bit packed store
The AVX-512 specialization for BFloat16 output uses Unroll<COLS>{} for compile-time loop unrolling across the column dimension (COLS = BLOCK_N / 16).
2. Core INT8 GEMM Micro-kernel (tinygemm_kernel_nn)
The tinygemm_kernel_nn struct implements the core uint8-times-int8 dot product GEMM:
- Uses _mm512_dpbusd_epi32 (VNNI instruction) for fused multiply-accumulate: uint8 * int8 -> int32
- Processes BLOCK_M x BLOCK_N tiles with configurable block sizes
- Supports prefetching via PREFETCH_SIZE_K constant
- After accumulation, calls scale_C::apply for output rescaling
The s8s8 compensation mechanism corrects for Intel VNNI using unsigned-times-signed multiplication when both activations and weights are signed int8.
Usage
Use this GEMM variant for serving models with INT8 quantized weights on CPU. Supports both weight-only quantization (W8A16 with dynamic activation quantization) and full W8A8 quantization.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/gemm_int8.cpp
- Lines: 1-547
Signature
// Output dequantization and rescaling after int8 GEMM
template <typename scalar_t, bool has_bias, int BLOCK_N>
struct scale_C {
static inline void apply(
scalar_t* __restrict__ C,
const int32_t* __restrict__ Ctmp,
const int32_t* __restrict__ Bcomp,
const float* __restrict__ bias,
float As,
const float* __restrict__ Bs);
};
// AVX-512 specialization for BFloat16 output
template <bool has_bias, int BLOCK_N>
struct scale_C<at::BFloat16, has_bias, BLOCK_N> {
static inline void apply(
at::BFloat16* __restrict__ C,
const int32_t* __restrict__ Ctmp,
const int32_t* __restrict__ Bcomp,
const float* __restrict__ bias,
float As,
const float* __restrict__ Bs);
};
// Core INT8 GEMM micro-kernel
template <typename scalar_t, bool has_bias, int BLOCK_M, int BLOCK_N>
struct tinygemm_kernel_nn {
static inline void apply(
const uint8_t* __restrict__ A,
const int8_t* __restrict__ B,
scalar_t* __restrict__ C,
const float* __restrict__ As,
const float* __restrict__ Bs,
const int32_t* __restrict__ Bcomp,
const float* __restrict__ bias,
int64_t K, int64_t lda, int64_t ldb, int64_t ldc);
};
Import
#include "common.h"
#include "gemm.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| A | uint8_t* | Yes | Quantized activation matrix in uint8 format |
| B | int8_t* | Yes | VNNI-packed int8 weight matrix [K/4, N, 4] with compensation |
| As | float | Yes | Per-token activation quantization scale |
| Bs | float* | Yes | Per-channel weight quantization scales, length N |
| Bcomp | int32_t* | Yes | S8S8 compensation values for signed-unsigned correction, length N |
| bias | float* | No | Optional bias vector, length N |
| K | int64_t | Yes | Shared dimension (input channels) |
| lda | int64_t | Yes | Leading dimension of activation matrix |
| ldb | int64_t | Yes | Leading dimension of weight matrix |
| ldc | int64_t | Yes | Leading dimension of output matrix |
Outputs
| Name | Type | Description |
|---|---|---|
| C | scalar_t* | Output matrix in BFloat16, shape [BLOCK_M, BLOCK_N], rescaled and bias-corrected |
Usage Examples
INT8 GEMM with Scale and Compensation
// Perform W8A8 GEMM with s8s8 compensation
tinygemm_kernel_nn<at::BFloat16, /*has_bias=*/true, /*BLOCK_M=*/4, /*BLOCK_N=*/32>::apply(
quant_activations, // A: uint8 [M, K]
packed_weights, // B: int8 VNNI [K/4, N, 4]
output, // C: BF16 [M, N]
&act_scale, // As: per-token activation scale
weight_scales, // Bs: per-channel weight scales
compensation, // Bcomp: s8s8 correction values
bias_ptr, // bias: optional bias vector
K, lda, ldb, ldc);
Output Rescaling Only
// Apply scale and compensation to raw int32 GEMM output
scale_C<at::BFloat16, /*has_bias=*/false, /*BLOCK_N=*/32>::apply(
bf16_output, // C: BF16 output
int32_accum, // Ctmp: raw int32 accumulation
compensation, // Bcomp: s8s8 compensation
nullptr, // bias: not used
act_scale, // As: activation scale
weight_scales); // Bs: per-channel weight scales