Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU GEMM INT8

From Leeroopedia


Knowledge Sources
Domains GEMM, Quantization, CPU Compute
Last Updated 2026-02-10 00:00 GMT

Overview

Implements INT8 weight-only and W8A8 (8-bit weight, 8-bit activation) quantized GEMM for CPU inference using Intel AMX int8 instructions.

Description

gemm_int8.cpp provides the CPU implementation of INT8 quantized GEMM with approximately 2x memory savings and improved throughput through integer arithmetic. INT8 quantization offers one of the best accuracy-performance tradeoffs among quantization schemes.

The implementation consists of two main components:

1. Output Dequantization (scale_C)

The scale_C struct template handles post-GEMM output rescaling:

  • Subtracts s8s8 compensation (Bcomp) from int32 accumulation results to correct for the signed-unsigned mismatch in Intel VNNI instructions
  • Converts int32 to float via _mm512_cvtepi32_ps
  • Applies per-token activation scale (As) and per-channel weight scales (Bs) via _mm512_mul_ps
  • Optionally fuses bias addition when has_bias is true
  • Converts float back to BFloat16 using _mm512_cvtne2ps_pbh for 512-bit packed store

The AVX-512 specialization for BFloat16 output uses Unroll<COLS>{} for compile-time loop unrolling across the column dimension (COLS = BLOCK_N / 16).

2. Core INT8 GEMM Micro-kernel (tinygemm_kernel_nn)

The tinygemm_kernel_nn struct implements the core uint8-times-int8 dot product GEMM:

  • Uses _mm512_dpbusd_epi32 (VNNI instruction) for fused multiply-accumulate: uint8 * int8 -> int32
  • Processes BLOCK_M x BLOCK_N tiles with configurable block sizes
  • Supports prefetching via PREFETCH_SIZE_K constant
  • After accumulation, calls scale_C::apply for output rescaling

The s8s8 compensation mechanism corrects for Intel VNNI using unsigned-times-signed multiplication when both activations and weights are signed int8.

Usage

Use this GEMM variant for serving models with INT8 quantized weights on CPU. Supports both weight-only quantization (W8A16 with dynamic activation quantization) and full W8A8 quantization.

Code Reference

Source Location

Signature

// Output dequantization and rescaling after int8 GEMM
template <typename scalar_t, bool has_bias, int BLOCK_N>
struct scale_C {
  static inline void apply(
      scalar_t* __restrict__ C,
      const int32_t* __restrict__ Ctmp,
      const int32_t* __restrict__ Bcomp,
      const float* __restrict__ bias,
      float As,
      const float* __restrict__ Bs);
};

// AVX-512 specialization for BFloat16 output
template <bool has_bias, int BLOCK_N>
struct scale_C<at::BFloat16, has_bias, BLOCK_N> {
  static inline void apply(
      at::BFloat16* __restrict__ C,
      const int32_t* __restrict__ Ctmp,
      const int32_t* __restrict__ Bcomp,
      const float* __restrict__ bias,
      float As,
      const float* __restrict__ Bs);
};

// Core INT8 GEMM micro-kernel
template <typename scalar_t, bool has_bias, int BLOCK_M, int BLOCK_N>
struct tinygemm_kernel_nn {
  static inline void apply(
      const uint8_t* __restrict__ A,
      const int8_t* __restrict__ B,
      scalar_t* __restrict__ C,
      const float* __restrict__ As,
      const float* __restrict__ Bs,
      const int32_t* __restrict__ Bcomp,
      const float* __restrict__ bias,
      int64_t K, int64_t lda, int64_t ldb, int64_t ldc);
};

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
A uint8_t* Yes Quantized activation matrix in uint8 format
B int8_t* Yes VNNI-packed int8 weight matrix [K/4, N, 4] with compensation
As float Yes Per-token activation quantization scale
Bs float* Yes Per-channel weight quantization scales, length N
Bcomp int32_t* Yes S8S8 compensation values for signed-unsigned correction, length N
bias float* No Optional bias vector, length N
K int64_t Yes Shared dimension (input channels)
lda int64_t Yes Leading dimension of activation matrix
ldb int64_t Yes Leading dimension of weight matrix
ldc int64_t Yes Leading dimension of output matrix

Outputs

Name Type Description
C scalar_t* Output matrix in BFloat16, shape [BLOCK_M, BLOCK_N], rescaled and bias-corrected

Usage Examples

INT8 GEMM with Scale and Compensation

// Perform W8A8 GEMM with s8s8 compensation
tinygemm_kernel_nn<at::BFloat16, /*has_bias=*/true, /*BLOCK_M=*/4, /*BLOCK_N=*/32>::apply(
    quant_activations,   // A: uint8 [M, K]
    packed_weights,      // B: int8 VNNI [K/4, N, 4]
    output,              // C: BF16 [M, N]
    &act_scale,          // As: per-token activation scale
    weight_scales,       // Bs: per-channel weight scales
    compensation,        // Bcomp: s8s8 correction values
    bias_ptr,            // bias: optional bias vector
    K, lda, ldb, ldc);

Output Rescaling Only

// Apply scale and compensation to raw int32 GEMM output
scale_C<at::BFloat16, /*has_bias=*/false, /*BLOCK_N=*/32>::apply(
    bf16_output,       // C: BF16 output
    int32_accum,       // Ctmp: raw int32 accumulation
    compensation,      // Bcomp: s8s8 compensation
    nullptr,           // bias: not used
    act_scale,         // As: activation scale
    weight_scales);    // Bs: per-channel weight scales

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment