Implementation:Vllm project Vllm SGL MoE
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, MoE, GEMM, Quantization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements a fused Mixture-of-Experts (MoE) kernel with AMX acceleration for CPU, combining expert gating, GEMM, SiLU activation, and output accumulation in a single pass.
Description
This file provides the complete CPU-optimized MoE inference pipeline adapted from SGLang. The moe_align_block_size function sorts tokens by expert assignment and pads to block boundaries, producing sorted_ids, expert_ids, and offset arrays for efficient parallel dispatch. The fused_experts_cpu function then executes a two-stage GEMM pipeline: the first GEMM (hidden_states x w1) is fused with SiLU-and-mul activation via silu_and_mul, and the second GEMM (activated x w2) produces the final expert outputs, which are accumulated with top-k weights back into the output tensor.
The implementation supports three precision modes: native BF16/FP16, INT8 w8a8 (using VNNI with s8s8 compensation), and FP8 w8a16 (using online dequantization). A separate shared_expert_cpu function handles shared expert computation with an additive routing scaling factor. Helper stubs (copy_stub, copy_mul_stub, sum_stub, add_mul_stub, fill_stub) provide vectorized utility operations for data movement and accumulation.
Usage
This code is compiled as part of the vLLM SGL-kernels CPU extension. It is the primary MoE execution kernel for CPU inference, invoked for models like Mixtral and DeepSeek that use sparse MoE layers.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/sgl-kernels/moe.cpp
- Lines: 1-1330
Signature
at::Tensor fused_experts_cpu(
at::Tensor& hidden_states,
at::Tensor& w1,
at::Tensor& w2,
at::Tensor& topk_weights,
at::Tensor& topk_ids,
bool inplace,
bool use_int8_w8a8,
bool use_fp8_w8a16,
const std::optional<at::Tensor>& w1_scale,
const std::optional<at::Tensor>& w2_scale,
const std::optional<std::vector<int64_t>> block_size,
const std::optional<at::Tensor>& a1_scale,
const std::optional<at::Tensor>& a2_scale,
bool is_vnni);
at::Tensor shared_expert_cpu(
at::Tensor& hidden_states,
at::Tensor& w1,
at::Tensor& w2,
at::Tensor& fused_experts_out,
double routed_scaling_factor,
bool inplace,
bool use_int8_w8a8,
bool use_fp8_w8a16,
std::optional<at::Tensor>& w1_scale,
std::optional<at::Tensor>& w2_scale,
std::optional<std::vector<int64_t>> block_size,
std::optional<at::Tensor>& a1_scale,
std::optional<at::Tensor>& a2_scale,
bool is_vnni);
template <int BLOCK_M>
int moe_align_block_size(
int32_t* sorted_ids,
int32_t* expert_ids,
int32_t* topk_ids,
int32_t* total_cnts,
int32_t* cumsums,
int32_t* offsets,
int num_experts,
int numel,
int num_threads);
template <typename scalar_t, int BLOCK_N>
inline void silu_and_mul(
scalar_t* output,
const float* input0,
const float* input1,
int64_t m_size,
int64_t N);
Import
#include "common.h"
#include "vec.h"
#include "gemm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_states | at::Tensor [M, K] | Yes | Input activations for all tokens |
| w1 | at::Tensor [E, 2N, K] | Yes | Gate-up projection weights for all E experts (2N due to SiLU gating) |
| w2 | at::Tensor [E, K, N] | Yes | Down projection weights for all E experts |
| topk_weights | at::Tensor [M, topk] (float) | Yes | Expert routing weights per token from the gating network |
| topk_ids | at::Tensor [M, topk] (int32) | Yes | Selected expert indices per token from the gating network |
| inplace | bool | Yes | If true, accumulate results directly into hidden_states |
| use_int8_w8a8 | bool | Yes | Enable INT8 weight-and-activation quantization path |
| use_fp8_w8a16 | bool | Yes | Enable FP8 weight-only quantization path |
| w1_scale | at::Tensor | No | Quantization scales for w1 (required when quantization is enabled) |
| w2_scale | at::Tensor | No | Quantization scales for w2 (required when quantization is enabled) |
| block_size | std::vector<int64_t> | No | Block dimensions for blockwise FP8 quantization |
| is_vnni | bool | Yes | Whether weights are already in VNNI-packed format |
Outputs
| Name | Type | Description |
|---|---|---|
| out_hidden_states | at::Tensor [M, K] | MoE output: weighted sum of expert outputs for each token, same shape as input |
Usage Examples
// Fused MoE with BF16 weights
at::Tensor moe_output = fused_experts_cpu(
hidden_states, // [M, K] BFloat16
w1, // [E, 2N, K] BFloat16
w2, // [E, K, N] BFloat16
topk_weights, // [M, topk] float32
topk_ids, // [M, topk] int32
/*inplace=*/false,
/*use_int8_w8a8=*/false,
/*use_fp8_w8a16=*/false,
/*w1_scale=*/std::nullopt,
/*w2_scale=*/std::nullopt,
/*block_size=*/std::nullopt,
/*a1_scale=*/std::nullopt,
/*a2_scale=*/std::nullopt,
/*is_vnni=*/false);
// Shared expert with routing scaling factor
at::Tensor shared_out = shared_expert_cpu(
hidden_states, // [M, K]
shared_w1, // [2N, K]
shared_w2, // [K, N]
moe_output, // [M, K] from fused_experts
/*routed_scaling_factor=*/1.0,
/*inplace=*/true,
/*use_int8_w8a8=*/false,
/*use_fp8_w8a16=*/false,
/*w1_scale=*/std::nullopt,
/*w2_scale=*/std::nullopt,
/*block_size=*/std::nullopt,
/*a1_scale=*/std::nullopt,
/*a2_scale=*/std::nullopt,
/*is_vnni=*/false);