Implementation:Vllm project Vllm SGL GEMM INT8
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, GEMM, Quantization, INT8 |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements INT8 weight-and-activation (w8a8) GEMM using AVX-512 VNNI instructions with s8s8-to-u8s8 compensation for CPU inference.
Description
This file provides the highest-performance INT8 matmul path on CPU using the _mm512_dpbusd_epi32 VNNI instruction for u8*s8 dot-product accumulation. Since AVX-512 VNNI lacks native s8*s8 support, the kernel applies a compensation technique: activations are shifted by +128 to convert from signed to unsigned, and a precomputed term 128 * sum(B_column) is subtracted from the result. The tinygemm_kernel_nn template implements tile-level computation with per-row activation scales (As) and per-column weight scales (Bs).
The public APIs include per_token_quant_int8_cpu for dynamic per-token activation quantization, int8_scaled_mm_cpu for separate quantization and matmul, and int8_scaled_mm_with_quant which fuses both steps for reduced memory traffic.
Usage
This code is compiled as part of the vLLM SGL-kernels CPU extension. It is used for INT8 quantized model serving where both weights and activations are quantized to 8-bit, providing maximum throughput on CPUs with VNNI or AMX-INT8 support.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/sgl-kernels/gemm_int8.cpp
- Lines: 1-440
Signature
std::tuple<at::Tensor, at::Tensor> per_token_quant_int8_cpu(
at::Tensor& A);
at::Tensor int8_scaled_mm_cpu(
at::Tensor& mat1,
at::Tensor& mat2,
at::Tensor& scales1,
at::Tensor& scales2,
std::optional<at::Tensor>& bias,
at::ScalarType out_dtype,
bool is_vnni);
at::Tensor int8_scaled_mm_with_quant(
at::Tensor& mat1,
at::Tensor& mat2,
at::Tensor& scales2,
const std::optional<at::Tensor>& bias,
at::ScalarType out_dtype,
bool is_vnni);
template <typename scalar_t>
void tinygemm_kernel(
const uint8_t* A, const int8_t* B, scalar_t* C,
int32_t* Ctmp, const float* As, const float* Bs,
int64_t M, int64_t N, int64_t K,
int64_t lda, int64_t ldb, int64_t ldc, bool brg);
Import
#include "common.h"
#include "vec.h"
#include "gemm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mat1 | at::Tensor [M, K] | Yes | Activation matrix; uint8 for int8_scaled_mm_cpu, BF16/Half for int8_scaled_mm_with_quant |
| mat2 | at::Tensor [N, K] (int8) | Yes | Weight matrix in INT8 (signed), optionally in VNNI-packed format with compensation |
| scales1 | at::Tensor [M] (float) | Yes (for int8_scaled_mm_cpu) | Per-token activation quantization scales |
| scales2 | at::Tensor [N] (float) | Yes | Per-channel weight quantization scales |
| bias | at::Tensor [N] (float) | No | Optional bias vector added after the dequantized matmul result |
| out_dtype | at::ScalarType | Yes | Output data type (BFloat16 or Half) |
| is_vnni | bool | Yes | Whether mat2 is already in VNNI-packed format with s8s8 compensation |
Outputs
| Name | Type | Description |
|---|---|---|
| out | at::Tensor [M, N] | Dequantized matmul result in out_dtype |
| Aq | at::Tensor [M, K] (uint8) | Quantized activations from per_token_quant_int8_cpu (shifted by +128) |
| As | at::Tensor [M] (float) | Per-token scales from per_token_quant_int8_cpu |
Usage Examples
// Fused quantize + INT8 matmul (most common path)
at::Tensor output = int8_scaled_mm_with_quant(
activations, // [M, K] BFloat16
packed_weights, // [N, K+4] int8 VNNI-packed with compensation
weight_scales, // [N] float32
bias, // optional [N] float32
at::kBFloat16, // output dtype
/*is_vnni=*/true);
// Or separate quantize + matmul
auto [Aq, As] = per_token_quant_int8_cpu(activations);
at::Tensor output = int8_scaled_mm_cpu(
Aq, packed_weights, As, weight_scales,
bias, at::kBFloat16, /*is_vnni=*/true);