Implementation:Vllm project Vllm SGL GEMM INT8

Knowledge Sources	vllm
Domains	CPU_Inference, GEMM, Quantization, INT8
Last Updated	2026-02-08 00:00 GMT

Overview

Implements INT8 weight-and-activation (w8a8) GEMM using AVX-512 VNNI instructions with s8s8-to-u8s8 compensation for CPU inference.

Description

This file provides the highest-performance INT8 matmul path on CPU using the _mm512_dpbusd_epi32 VNNI instruction for u8*s8 dot-product accumulation. Since AVX-512 VNNI lacks native s8*s8 support, the kernel applies a compensation technique: activations are shifted by +128 to convert from signed to unsigned, and a precomputed term 128 * sum(B_column) is subtracted from the result. The tinygemm_kernel_nn template implements tile-level computation with per-row activation scales (As) and per-column weight scales (Bs).

The public APIs include per_token_quant_int8_cpu for dynamic per-token activation quantization, int8_scaled_mm_cpu for separate quantization and matmul, and int8_scaled_mm_with_quant which fuses both steps for reduced memory traffic.

Usage

This code is compiled as part of the vLLM SGL-kernels CPU extension. It is used for INT8 quantized model serving where both weights and activations are quantized to 8-bit, providing maximum throughput on CPUs with VNNI or AMX-INT8 support.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/sgl-kernels/gemm_int8.cpp
Lines: 1-440

Signature

std::tuple<at::Tensor, at::Tensor> per_token_quant_int8_cpu(
    at::Tensor& A);

at::Tensor int8_scaled_mm_cpu(
    at::Tensor& mat1,
    at::Tensor& mat2,
    at::Tensor& scales1,
    at::Tensor& scales2,
    std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype,
    bool is_vnni);

at::Tensor int8_scaled_mm_with_quant(
    at::Tensor& mat1,
    at::Tensor& mat2,
    at::Tensor& scales2,
    const std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype,
    bool is_vnni);

template <typename scalar_t>
void tinygemm_kernel(
    const uint8_t* A, const int8_t* B, scalar_t* C,
    int32_t* Ctmp, const float* As, const float* Bs,
    int64_t M, int64_t N, int64_t K,
    int64_t lda, int64_t ldb, int64_t ldc, bool brg);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name	Type	Required	Description
mat1	at::Tensor [M, K]	Yes	Activation matrix; uint8 for int8_scaled_mm_cpu, BF16/Half for int8_scaled_mm_with_quant
mat2	at::Tensor [N, K] (int8)	Yes	Weight matrix in INT8 (signed), optionally in VNNI-packed format with compensation
scales1	at::Tensor [M] (float)	Yes (for int8_scaled_mm_cpu)	Per-token activation quantization scales
scales2	at::Tensor [N] (float)	Yes	Per-channel weight quantization scales
bias	at::Tensor [N] (float)	No	Optional bias vector added after the dequantized matmul result
out_dtype	at::ScalarType	Yes	Output data type (BFloat16 or Half)
is_vnni	bool	Yes	Whether mat2 is already in VNNI-packed format with s8s8 compensation

Outputs

Name	Type	Description
out	at::Tensor [M, N]	Dequantized matmul result in out_dtype
Aq	at::Tensor [M, K] (uint8)	Quantized activations from per_token_quant_int8_cpu (shifted by +128)
As	at::Tensor [M] (float)	Per-token scales from per_token_quant_int8_cpu

Usage Examples

// Fused quantize + INT8 matmul (most common path)
at::Tensor output = int8_scaled_mm_with_quant(
    activations,     // [M, K] BFloat16
    packed_weights,  // [N, K+4] int8 VNNI-packed with compensation
    weight_scales,   // [N] float32
    bias,            // optional [N] float32
    at::kBFloat16,   // output dtype
    /*is_vnni=*/true);

// Or separate quantize + matmul
auto [Aq, As] = per_token_quant_int8_cpu(activations);
at::Tensor output = int8_scaled_mm_cpu(
    Aq, packed_weights, As, weight_scales,
    bias, at::kBFloat16, /*is_vnni=*/true);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment