Implementation:Vllm project Vllm SGL MoE INT8

Knowledge Sources	vllm
Domains	Quantization, Mixture of Experts
Last Updated	2026-02-08 00:00 GMT

Overview

Implements INT8 weight-and-activation (w8a8) quantized Mixture-of-Experts kernels for CPU inference using VNNI (Vector Neural Network Instructions) for maximum throughput.

Description

This file provides the highest-performance MoE variant for CPU, combining dynamic activation quantization with INT8 VNNI GEMM and zero-point compensation. The tinygemm_kernel_vnni struct implements the core VNNI dot-product accumulation using _mm512_dpbusd_epi32 intrinsics for fused w13 (gate+up) projections, while tinygemm_kernel_vnni2 handles the w2 (down) projection. The main entry point fused_experts_int8_kernel_impl orchestrates the two-stage GEMM pipeline with SiLU activation fusion, gating weight application, and top-k expert accumulation.

Usage

This code is compiled as part of the vLLM CPU SGL-kernels extension. It is invoked when running Mixture-of-Experts models with INT8 quantization (use_int8_w8a8=True) on CPU backends with AVX512-VNNI support.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/sgl-kernels/moe_int8.cpp
Lines: 1-769

Signature

template <typename scalar_t, int BLOCK_M, int BLOCK_N>
struct tinygemm_kernel_vnni {
  static inline void apply(
      const uint8_t* __restrict__ A,
      const int8_t* __restrict__ B0,
      const int8_t* __restrict__ B1,
      scalar_t* __restrict__ C,
      const float* __restrict__ As,
      const float* __restrict__ Bs0,
      const float* __restrict__ Bs1,
      const int32_t* __restrict__ Bcomp0,
      const int32_t* __restrict__ Bcomp1,
      int64_t K, int64_t lda, int64_t ldb, int64_t ldc);
};

template <typename scalar_t>
void fused_experts_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    scalar_t* __restrict__ ic2,
    uint8_t* __restrict__ A_tmp,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const float* __restrict__ topk_weights,
    const int32_t* __restrict__ sorted_ids,
    const int32_t* __restrict__ expert_ids,
    const int32_t* __restrict__ offsets,
    int64_t M, int64_t N, int64_t K,
    int64_t E, int64_t topk,
    int64_t num_tokens_post_pad);

template <typename scalar_t>
void shared_expert_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const scalar_t* __restrict__ fused_experts_out,
    float routed_scaling_factor,
    int64_t M, int64_t N, int64_t K);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name	Type	Required	Description
output	scalar_t*	Yes	Pre-allocated output buffer for expert results
input	const scalar_t*	Yes	Hidden states input tensor of shape [M, K]
packed_w1	const int8_t*	Yes	INT8-quantized gate/up weights in VNNI format for all experts
packed_w2	const int8_t*	Yes	INT8-quantized down projection weights in VNNI format for all experts
w1s	const float*	Yes	Per-channel dequantization scales for w1
w2s	const float*	Yes	Per-channel dequantization scales for w2
topk_weights	const float*	Yes	Gating weights for top-k expert selection
sorted_ids	const int32_t*	Yes	Token-to-expert mapping sorted by expert
expert_ids	const int32_t*	Yes	Expert IDs for each sorted token group
offsets	const int32_t*	Yes	Offsets into sorted_ids for each expert group
M	int64_t	Yes	Number of tokens
N	int64_t	Yes	Intermediate dimension (half of gate+up)
K	int64_t	Yes	Hidden dimension
E	int64_t	Yes	Number of experts
topk	int64_t	Yes	Number of experts selected per token
num_tokens_post_pad	int64_t	Yes	Padded token count aligned to block size

Outputs

Name	Type	Description
output	scalar_t*	Weighted sum of expert outputs, shape [M, K]

Usage Examples

// Invoked internally through the fused_experts_cpu dispatch path
// with use_int8_w8a8=true.
// Instantiated for BFloat16 and Half:
fused_experts_int8_kernel_impl<at::BFloat16>(
    output_ptr, ic1_ptr, ic2_ptr,
    A_tmp_ptr, C_tmp_ptr, Aq_tmp_ptr, As_tmp_ptr,
    input_ptr, packed_w1_ptr, packed_w2_ptr,
    w1_scales_ptr, w2_scales_ptr,
    topk_weights_ptr, sorted_ids_ptr,
    expert_ids_ptr, offsets_ptr,
    M, N, K, E, topk, num_tokens_post_pad);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment