Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU MoE INT8

From Leeroopedia


Knowledge Sources
Domains Machine Learning, CPU Kernels, Quantization
Last Updated 2026-02-10 00:00 GMT

Overview

Implements the INT8 weight-quantized (W8A8) variant of the fused Mixture-of-Experts kernel for CPU inference, using signed int8 weights with dynamically quantized unsigned int8 activations and AVX-512 VNNI instructions.

Description

This kernel uses tinygemm_kernel_vnni and tinygemm_kernel_vnni2 struct templates for the two GEMM phases of MoE computation, leveraging the AVX-512 VNNI _mm512_dpbusd_epi32 instruction for uint8-int8 dot products with int32 accumulation.

The kernel follows the same fused two-GEMM-with-SiLU pattern as the BFloat16 variant:

  • Stage 1: Align and sort tokens by expert (via moe_align_block_size).
  • Stage 2: For each expert block, compute GEMM1 (gate+up projection) with fused SiLU activation, then GEMM2 (down projection) with scaled output accumulation.
  • Stage 3: Sum across top-k expert outputs per token.

Key implementation details:

  • Dynamic activation quantization converts BFloat16 activations to uint8 with per-token scales on-the-fly before each GEMM.
  • s8s8 compensation from packed weights handles the signed-unsigned arithmetic mismatch inherent in VNNI's unsigned-signed multiply.
  • The copy_stub<uint8_t> specialization uses std::memcpy since quantized buffers may have non-standard sizes (64x + 32 alignment).
  • Per-channel weight scales and per-token activation scales are applied during dequantization after the integer GEMM.

Template instantiations are provided for at::BFloat16 and at::Half.

Usage

Use this kernel for production MoE deployments on CPU where INT8 W8A8 quantization provides the best accuracy-performance tradeoff. It offers approximately 2x memory savings and improved compute throughput through integer arithmetic compared to BFloat16, and generally better accuracy than INT4 quantization.

Code Reference

Source Location

Signature

template <typename scalar_t>
void fused_experts_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const float* __restrict__ topk_weights,
    const int32_t* __restrict__ sorted_ids,
    const int32_t* __restrict__ expert_ids,
    const int32_t* __restrict__ offsets,
    int64_t M, int64_t N, int64_t K,
    int64_t E, int64_t topk,
    int64_t num_tokens_post_pad);

template <typename scalar_t>
void shared_expert_int8_kernel_impl(
    scalar_t* __restrict__ output,
    scalar_t* __restrict__ ic1,
    float* __restrict__ C_tmp,
    uint8_t* __restrict__ Aq_tmp,
    float* __restrict__ As_tmp,
    const scalar_t* __restrict__ input,
    const int8_t* __restrict__ packed_w1,
    const int8_t* __restrict__ packed_w2,
    const float* __restrict__ w1s,
    const float* __restrict__ w2s,
    const scalar_t* __restrict__ fused_experts_out,
    float routed_scaling_factor,
    int64_t M, int64_t N, int64_t K);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
input scalar_t* [M, K] Yes Input hidden states (BFloat16 or Half)
packed_w1 int8_t* [E, 2N, K_packed] Yes Gate+up projection weights in signed int8 VNNI format with s8s8 compensation
packed_w2 int8_t* [E, K, N_packed] Yes Down projection weights in signed int8 VNNI format
w1s float* [E, 2N] Yes Per-channel dequantization scales for w1
w2s float* [E, K] Yes Per-channel dequantization scales for w2
topk_weights float* [M, topk] Yes Routing weights for selected experts per token
sorted_ids int32_t* Yes Token indices sorted by expert assignment
expert_ids int32_t* Yes Expert assignment per sorted block
offsets int32_t* Yes Starting offsets for each M block
M, N, K int64_t Yes Matrix dimensions: tokens, intermediate size, hidden size
E int64_t Yes Number of experts
topk int64_t Yes Number of selected experts per token

Outputs

Name Type Description
output scalar_t* [M, K] MoE output after INT8-quantized expert computation and weighted accumulation

Usage Examples

// INT8 MoE is dispatched via the main fused_experts_cpu() entry point
at::Tensor output = fused_experts_cpu(
    hidden_states,              // [M, K] BFloat16
    w1_int8,                    // [E, 2N, K_packed] int8 VNNI
    w2_int8,                    // [E, K, N_packed] int8 VNNI
    topk_weights,               // [M, topk] float32
    topk_ids,                   // [M, topk] int32
    /*inplace=*/false,
    /*moe_comp_method=*/CPUQuantMethod::INT8_W8A8,
    w1_scale,                   // per-channel scales
    w2_scale,
    /*w1_zero=*/std::nullopt,
    /*w2_zero=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*is_vnni=*/true);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment