Implementation:Sgl project Sglang CPU MoE INT8

Knowledge Sources	Sgl_project_Sglang
Domains	Machine Learning, CPU Kernels, Quantization
Last Updated	2026-02-10 00:00 GMT

Overview

Implements the INT8 weight-quantized (W8A8) variant of the fused Mixture-of-Experts kernel for CPU inference, using signed int8 weights with dynamically quantized unsigned int8 activations and AVX-512 VNNI instructions.

Description

This kernel uses tinygemm_kernel_vnni and tinygemm_kernel_vnni2 struct templates for the two GEMM phases of MoE computation, leveraging the AVX-512 VNNI _mm512_dpbusd_epi32 instruction for uint8-int8 dot products with int32 accumulation.

The kernel follows the same fused two-GEMM-with-SiLU pattern as the BFloat16 variant:

Stage 1: Align and sort tokens by expert (via moe_align_block_size).
Stage 2: For each expert block, compute GEMM1 (gate+up projection) with fused SiLU activation, then GEMM2 (down projection) with scaled output accumulation.
Stage 3: Sum across top-k expert outputs per token.

Key implementation details:

Dynamic activation quantization converts BFloat16 activations to uint8 with per-token scales on-the-fly before each GEMM.
s8s8 compensation from packed weights handles the signed-unsigned arithmetic mismatch inherent in VNNI's unsigned-signed multiply.
The copy_stub<uint8_t> specialization uses std::memcpy since quantized buffers may have non-standard sizes (64x + 32 alignment).
Per-channel weight scales and per-token activation scales are applied during dequantization after the integer GEMM.

Template instantiations are provided for at::BFloat16 and at::Half.

Usage

Use this kernel for production MoE deployments on CPU where INT8 W8A8 quantization provides the best accuracy-performance tradeoff. It offers approximately 2x memory savings and improved compute throughput through integer arithmetic compared to BFloat16, and generally better accuracy than INT4 quantization.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/moe_int8.cpp
Lines: 1-1068

Signature

template <typename scalar_t>
void fused_experts_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const float* __restrict__ topk_weights,
    const int32_t* __restrict__ sorted_ids,
    const int32_t* __restrict__ expert_ids,
    const int32_t* __restrict__ offsets,
    int64_t M, int64_t N, int64_t K,
    int64_t E, int64_t topk,
    int64_t num_tokens_post_pad);

template <typename scalar_t>
void shared_expert_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const scalar_t* __restrict__ fused_experts_out,
    float routed_scaling_factor,
    int64_t M, int64_t N, int64_t K);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
input	scalar_t* [M, K]	Yes	Input hidden states (BFloat16 or Half)
packed_w1	int8_t* [E, 2N, K_packed]	Yes	Gate+up projection weights in signed int8 VNNI format with s8s8 compensation
packed_w2	int8_t* [E, K, N_packed]	Yes	Down projection weights in signed int8 VNNI format
w1s	float* [E, 2N]	Yes	Per-channel dequantization scales for w1
w2s	float* [E, K]	Yes	Per-channel dequantization scales for w2
topk_weights	float* [M, topk]	Yes	Routing weights for selected experts per token
sorted_ids	int32_t*	Yes	Token indices sorted by expert assignment
expert_ids	int32_t*	Yes	Expert assignment per sorted block
offsets	int32_t*	Yes	Starting offsets for each M block
M, N, K	int64_t	Yes	Matrix dimensions: tokens, intermediate size, hidden size
E	int64_t	Yes	Number of experts
topk	int64_t	Yes	Number of selected experts per token

Outputs

Name	Type	Description
output	scalar_t* [M, K]	MoE output after INT8-quantized expert computation and weighted accumulation

Usage Examples

// INT8 MoE is dispatched via the main fused_experts_cpu() entry point
at::Tensor output = fused_experts_cpu(
    hidden_states,              // [M, K] BFloat16
    w1_int8,                    // [E, 2N, K_packed] int8 VNNI
    w2_int8,                    // [E, K, N_packed] int8 VNNI
    topk_weights,               // [M, topk] float32
    topk_ids,                   // [M, topk] int32
    /*inplace=*/false,
    /*moe_comp_method=*/CPUQuantMethod::INT8_W8A8,
    w1_scale,                   // per-channel scales
    w2_scale,
    /*w1_zero=*/std::nullopt,
    /*w2_zero=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*is_vnni=*/true);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment