Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm SGL MoE INT8

From Leeroopedia


Knowledge Sources
Domains Quantization, Mixture of Experts
Last Updated 2026-02-08 00:00 GMT

Overview

Implements INT8 weight-and-activation (w8a8) quantized Mixture-of-Experts kernels for CPU inference using VNNI (Vector Neural Network Instructions) for maximum throughput.

Description

This file provides the highest-performance MoE variant for CPU, combining dynamic activation quantization with INT8 VNNI GEMM and zero-point compensation. The tinygemm_kernel_vnni struct implements the core VNNI dot-product accumulation using _mm512_dpbusd_epi32 intrinsics for fused w13 (gate+up) projections, while tinygemm_kernel_vnni2 handles the w2 (down) projection. The main entry point fused_experts_int8_kernel_impl orchestrates the two-stage GEMM pipeline with SiLU activation fusion, gating weight application, and top-k expert accumulation.

Usage

This code is compiled as part of the vLLM CPU SGL-kernels extension. It is invoked when running Mixture-of-Experts models with INT8 quantization (use_int8_w8a8=True) on CPU backends with AVX512-VNNI support.

Code Reference

Source Location

Signature

template <typename scalar_t, int BLOCK_M, int BLOCK_N>
struct tinygemm_kernel_vnni {
  static inline void apply(
      const uint8_t* __restrict__ A,
      const int8_t* __restrict__ B0,
      const int8_t* __restrict__ B1,
      scalar_t* __restrict__ C,
      const float* __restrict__ As,
      const float* __restrict__ Bs0,
      const float* __restrict__ Bs1,
      const int32_t* __restrict__ Bcomp0,
      const int32_t* __restrict__ Bcomp1,
      int64_t K, int64_t lda, int64_t ldb, int64_t ldc);
};

template <typename scalar_t>
void fused_experts_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    scalar_t* __restrict__ ic2,
    uint8_t* __restrict__ A_tmp,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const float* __restrict__ topk_weights,
    const int32_t* __restrict__ sorted_ids,
    const int32_t* __restrict__ expert_ids,
    const int32_t* __restrict__ offsets,
    int64_t M, int64_t N, int64_t K,
    int64_t E, int64_t topk,
    int64_t num_tokens_post_pad);

template <typename scalar_t>
void shared_expert_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const scalar_t* __restrict__ fused_experts_out,
    float routed_scaling_factor,
    int64_t M, int64_t N, int64_t K);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name Type Required Description
output scalar_t* Yes Pre-allocated output buffer for expert results
input const scalar_t* Yes Hidden states input tensor of shape [M, K]
packed_w1 const int8_t* Yes INT8-quantized gate/up weights in VNNI format for all experts
packed_w2 const int8_t* Yes INT8-quantized down projection weights in VNNI format for all experts
w1s const float* Yes Per-channel dequantization scales for w1
w2s const float* Yes Per-channel dequantization scales for w2
topk_weights const float* Yes Gating weights for top-k expert selection
sorted_ids const int32_t* Yes Token-to-expert mapping sorted by expert
expert_ids const int32_t* Yes Expert IDs for each sorted token group
offsets const int32_t* Yes Offsets into sorted_ids for each expert group
M int64_t Yes Number of tokens
N int64_t Yes Intermediate dimension (half of gate+up)
K int64_t Yes Hidden dimension
E int64_t Yes Number of experts
topk int64_t Yes Number of experts selected per token
num_tokens_post_pad int64_t Yes Padded token count aligned to block size

Outputs

Name Type Description
output scalar_t* Weighted sum of expert outputs, shape [M, K]

Usage Examples

// Invoked internally through the fused_experts_cpu dispatch path
// with use_int8_w8a8=true.
// Instantiated for BFloat16 and Half:
fused_experts_int8_kernel_impl<at::BFloat16>(
    output_ptr, ic1_ptr, ic2_ptr,
    A_tmp_ptr, C_tmp_ptr, Aq_tmp_ptr, As_tmp_ptr,
    input_ptr, packed_w1_ptr, packed_w2_ptr,
    w1_scales_ptr, w2_scales_ptr,
    topk_weights_ptr, sorted_ids_ptr,
    expert_ids_ptr, offsets_ptr,
    M, N, K, E, topk, num_tokens_post_pad);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment