Implementation:Sgl project Sglang CPU MoE INT8
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, CPU Kernels, Quantization |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements the INT8 weight-quantized (W8A8) variant of the fused Mixture-of-Experts kernel for CPU inference, using signed int8 weights with dynamically quantized unsigned int8 activations and AVX-512 VNNI instructions.
Description
This kernel uses tinygemm_kernel_vnni and tinygemm_kernel_vnni2 struct templates for the two GEMM phases of MoE computation, leveraging the AVX-512 VNNI _mm512_dpbusd_epi32 instruction for uint8-int8 dot products with int32 accumulation.
The kernel follows the same fused two-GEMM-with-SiLU pattern as the BFloat16 variant:
- Stage 1: Align and sort tokens by expert (via moe_align_block_size).
- Stage 2: For each expert block, compute GEMM1 (gate+up projection) with fused SiLU activation, then GEMM2 (down projection) with scaled output accumulation.
- Stage 3: Sum across top-k expert outputs per token.
Key implementation details:
- Dynamic activation quantization converts BFloat16 activations to uint8 with per-token scales on-the-fly before each GEMM.
- s8s8 compensation from packed weights handles the signed-unsigned arithmetic mismatch inherent in VNNI's unsigned-signed multiply.
- The copy_stub<uint8_t> specialization uses std::memcpy since quantized buffers may have non-standard sizes (64x + 32 alignment).
- Per-channel weight scales and per-token activation scales are applied during dequantization after the integer GEMM.
Template instantiations are provided for at::BFloat16 and at::Half.
Usage
Use this kernel for production MoE deployments on CPU where INT8 W8A8 quantization provides the best accuracy-performance tradeoff. It offers approximately 2x memory savings and improved compute throughput through integer arithmetic compared to BFloat16, and generally better accuracy than INT4 quantization.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/moe_int8.cpp
- Lines: 1-1068
Signature
template <typename scalar_t>
void fused_experts_int8_kernel_impl(
scalar_t* __restrict__ output,
scalar_t* __restrict__ ic1,
float* __restrict__ C_tmp,
uint8_t* __restrict__ Aq_tmp,
float* __restrict__ As_tmp,
const scalar_t* __restrict__ input,
const int8_t* __restrict__ packed_w1,
const int8_t* __restrict__ packed_w2,
const float* __restrict__ w1s,
const float* __restrict__ w2s,
const float* __restrict__ topk_weights,
const int32_t* __restrict__ sorted_ids,
const int32_t* __restrict__ expert_ids,
const int32_t* __restrict__ offsets,
int64_t M, int64_t N, int64_t K,
int64_t E, int64_t topk,
int64_t num_tokens_post_pad);
template <typename scalar_t>
void shared_expert_int8_kernel_impl(
scalar_t* __restrict__ output,
scalar_t* __restrict__ ic1,
float* __restrict__ C_tmp,
uint8_t* __restrict__ Aq_tmp,
float* __restrict__ As_tmp,
const scalar_t* __restrict__ input,
const int8_t* __restrict__ packed_w1,
const int8_t* __restrict__ packed_w2,
const float* __restrict__ w1s,
const float* __restrict__ w2s,
const scalar_t* __restrict__ fused_experts_out,
float routed_scaling_factor,
int64_t M, int64_t N, int64_t K);
Import
#include "common.h"
#include "gemm.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | scalar_t* [M, K] | Yes | Input hidden states (BFloat16 or Half) |
| packed_w1 | int8_t* [E, 2N, K_packed] | Yes | Gate+up projection weights in signed int8 VNNI format with s8s8 compensation |
| packed_w2 | int8_t* [E, K, N_packed] | Yes | Down projection weights in signed int8 VNNI format |
| w1s | float* [E, 2N] | Yes | Per-channel dequantization scales for w1 |
| w2s | float* [E, K] | Yes | Per-channel dequantization scales for w2 |
| topk_weights | float* [M, topk] | Yes | Routing weights for selected experts per token |
| sorted_ids | int32_t* | Yes | Token indices sorted by expert assignment |
| expert_ids | int32_t* | Yes | Expert assignment per sorted block |
| offsets | int32_t* | Yes | Starting offsets for each M block |
| M, N, K | int64_t | Yes | Matrix dimensions: tokens, intermediate size, hidden size |
| E | int64_t | Yes | Number of experts |
| topk | int64_t | Yes | Number of selected experts per token |
Outputs
| Name | Type | Description |
|---|---|---|
| output | scalar_t* [M, K] | MoE output after INT8-quantized expert computation and weighted accumulation |
Usage Examples
// INT8 MoE is dispatched via the main fused_experts_cpu() entry point
at::Tensor output = fused_experts_cpu(
hidden_states, // [M, K] BFloat16
w1_int8, // [E, 2N, K_packed] int8 VNNI
w2_int8, // [E, K, N_packed] int8 VNNI
topk_weights, // [M, topk] float32
topk_ids, // [M, topk] int32
/*inplace=*/false,
/*moe_comp_method=*/CPUQuantMethod::INT8_W8A8,
w1_scale, // per-channel scales
w2_scale,
/*w1_zero=*/std::nullopt,
/*w2_zero=*/std::nullopt,
/*block_size=*/std::nullopt,
/*is_vnni=*/true);