Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm SGL GEMM INT8

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, GEMM, Quantization, INT8
Last Updated 2026-02-08 00:00 GMT

Overview

Implements INT8 weight-and-activation (w8a8) GEMM using AVX-512 VNNI instructions with s8s8-to-u8s8 compensation for CPU inference.

Description

This file provides the highest-performance INT8 matmul path on CPU using the _mm512_dpbusd_epi32 VNNI instruction for u8*s8 dot-product accumulation. Since AVX-512 VNNI lacks native s8*s8 support, the kernel applies a compensation technique: activations are shifted by +128 to convert from signed to unsigned, and a precomputed term 128 * sum(B_column) is subtracted from the result. The tinygemm_kernel_nn template implements tile-level computation with per-row activation scales (As) and per-column weight scales (Bs).

The public APIs include per_token_quant_int8_cpu for dynamic per-token activation quantization, int8_scaled_mm_cpu for separate quantization and matmul, and int8_scaled_mm_with_quant which fuses both steps for reduced memory traffic.

Usage

This code is compiled as part of the vLLM SGL-kernels CPU extension. It is used for INT8 quantized model serving where both weights and activations are quantized to 8-bit, providing maximum throughput on CPUs with VNNI or AMX-INT8 support.

Code Reference

Source Location

Signature

std::tuple<at::Tensor, at::Tensor> per_token_quant_int8_cpu(
    at::Tensor& A);

at::Tensor int8_scaled_mm_cpu(
    at::Tensor& mat1,
    at::Tensor& mat2,
    at::Tensor& scales1,
    at::Tensor& scales2,
    std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype,
    bool is_vnni);

at::Tensor int8_scaled_mm_with_quant(
    at::Tensor& mat1,
    at::Tensor& mat2,
    at::Tensor& scales2,
    const std::optional<at::Tensor>& bias,
    at::ScalarType out_dtype,
    bool is_vnni);

template <typename scalar_t>
void tinygemm_kernel(
    const uint8_t* A, const int8_t* B, scalar_t* C,
    int32_t* Ctmp, const float* As, const float* Bs,
    int64_t M, int64_t N, int64_t K,
    int64_t lda, int64_t ldb, int64_t ldc, bool brg);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name Type Required Description
mat1 at::Tensor [M, K] Yes Activation matrix; uint8 for int8_scaled_mm_cpu, BF16/Half for int8_scaled_mm_with_quant
mat2 at::Tensor [N, K] (int8) Yes Weight matrix in INT8 (signed), optionally in VNNI-packed format with compensation
scales1 at::Tensor [M] (float) Yes (for int8_scaled_mm_cpu) Per-token activation quantization scales
scales2 at::Tensor [N] (float) Yes Per-channel weight quantization scales
bias at::Tensor [N] (float) No Optional bias vector added after the dequantized matmul result
out_dtype at::ScalarType Yes Output data type (BFloat16 or Half)
is_vnni bool Yes Whether mat2 is already in VNNI-packed format with s8s8 compensation

Outputs

Name Type Description
out at::Tensor [M, N] Dequantized matmul result in out_dtype
Aq at::Tensor [M, K] (uint8) Quantized activations from per_token_quant_int8_cpu (shifted by +128)
As at::Tensor [M] (float) Per-token scales from per_token_quant_int8_cpu

Usage Examples

// Fused quantize + INT8 matmul (most common path)
at::Tensor output = int8_scaled_mm_with_quant(
    activations,     // [M, K] BFloat16
    packed_weights,  // [N, K+4] int8 VNNI-packed with compensation
    weight_scales,   // [N] float32
    bias,            // optional [N] float32
    at::kBFloat16,   // output dtype
    /*is_vnni=*/true);

// Or separate quantize + matmul
auto [Aq, As] = per_token_quant_int8_cpu(activations);
at::Tensor output = int8_scaled_mm_cpu(
    Aq, packed_weights, As, weight_scales,
    bias, at::kBFloat16, /*is_vnni=*/true);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment