Implementation:Vllm project Vllm SGL MoE

Knowledge Sources	vllm
Domains	CPU_Inference, MoE, GEMM, Quantization
Last Updated	2026-02-08 00:00 GMT

Overview

Implements a fused Mixture-of-Experts (MoE) kernel with AMX acceleration for CPU, combining expert gating, GEMM, SiLU activation, and output accumulation in a single pass.

Description

This file provides the complete CPU-optimized MoE inference pipeline adapted from SGLang. The moe_align_block_size function sorts tokens by expert assignment and pads to block boundaries, producing sorted_ids, expert_ids, and offset arrays for efficient parallel dispatch. The fused_experts_cpu function then executes a two-stage GEMM pipeline: the first GEMM (hidden_states x w1) is fused with SiLU-and-mul activation via silu_and_mul, and the second GEMM (activated x w2) produces the final expert outputs, which are accumulated with top-k weights back into the output tensor.

The implementation supports three precision modes: native BF16/FP16, INT8 w8a8 (using VNNI with s8s8 compensation), and FP8 w8a16 (using online dequantization). A separate shared_expert_cpu function handles shared expert computation with an additive routing scaling factor. Helper stubs (copy_stub, copy_mul_stub, sum_stub, add_mul_stub, fill_stub) provide vectorized utility operations for data movement and accumulation.

Usage

This code is compiled as part of the vLLM SGL-kernels CPU extension. It is the primary MoE execution kernel for CPU inference, invoked for models like Mixtral and DeepSeek that use sparse MoE layers.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/sgl-kernels/moe.cpp
Lines: 1-1330

Signature

at::Tensor fused_experts_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& topk_weights,
    at::Tensor& topk_ids,
    bool inplace,
    bool use_int8_w8a8,
    bool use_fp8_w8a16,
    const std::optional<at::Tensor>& w1_scale,
    const std::optional<at::Tensor>& w2_scale,
    const std::optional<std::vector<int64_t>> block_size,
    const std::optional<at::Tensor>& a1_scale,
    const std::optional<at::Tensor>& a2_scale,
    bool is_vnni);

at::Tensor shared_expert_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& fused_experts_out,
    double routed_scaling_factor,
    bool inplace,
    bool use_int8_w8a8,
    bool use_fp8_w8a16,
    std::optional<at::Tensor>& w1_scale,
    std::optional<at::Tensor>& w2_scale,
    std::optional<std::vector<int64_t>> block_size,
    std::optional<at::Tensor>& a1_scale,
    std::optional<at::Tensor>& a2_scale,
    bool is_vnni);

template <int BLOCK_M>
int moe_align_block_size(
    int32_t* sorted_ids,
    int32_t* expert_ids,
    int32_t* topk_ids,
    int32_t* total_cnts,
    int32_t* cumsums,
    int32_t* offsets,
    int num_experts,
    int numel,
    int num_threads);

template <typename scalar_t, int BLOCK_N>
inline void silu_and_mul(
    scalar_t* output,
    const float* input0,
    const float* input1,
    int64_t m_size,
    int64_t N);

Import

#include "common.h"
#include "vec.h"
#include "gemm.h"

I/O Contract

Inputs

Name	Type	Required	Description
hidden_states	at::Tensor [M, K]	Yes	Input activations for all tokens
w1	at::Tensor [E, 2N, K]	Yes	Gate-up projection weights for all E experts (2N due to SiLU gating)
w2	at::Tensor [E, K, N]	Yes	Down projection weights for all E experts
topk_weights	at::Tensor [M, topk] (float)	Yes	Expert routing weights per token from the gating network
topk_ids	at::Tensor [M, topk] (int32)	Yes	Selected expert indices per token from the gating network
inplace	bool	Yes	If true, accumulate results directly into hidden_states
use_int8_w8a8	bool	Yes	Enable INT8 weight-and-activation quantization path
use_fp8_w8a16	bool	Yes	Enable FP8 weight-only quantization path
w1_scale	at::Tensor	No	Quantization scales for w1 (required when quantization is enabled)
w2_scale	at::Tensor	No	Quantization scales for w2 (required when quantization is enabled)
block_size	std::vector<int64_t>	No	Block dimensions for blockwise FP8 quantization
is_vnni	bool	Yes	Whether weights are already in VNNI-packed format

Outputs

Name	Type	Description
out_hidden_states	at::Tensor [M, K]	MoE output: weighted sum of expert outputs for each token, same shape as input

Usage Examples

// Fused MoE with BF16 weights
at::Tensor moe_output = fused_experts_cpu(
    hidden_states,         // [M, K] BFloat16
    w1,                    // [E, 2N, K] BFloat16
    w2,                    // [E, K, N] BFloat16
    topk_weights,          // [M, topk] float32
    topk_ids,              // [M, topk] int32
    /*inplace=*/false,
    /*use_int8_w8a8=*/false,
    /*use_fp8_w8a16=*/false,
    /*w1_scale=*/std::nullopt,
    /*w2_scale=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*a1_scale=*/std::nullopt,
    /*a2_scale=*/std::nullopt,
    /*is_vnni=*/false);

// Shared expert with routing scaling factor
at::Tensor shared_out = shared_expert_cpu(
    hidden_states,         // [M, K]
    shared_w1,             // [2N, K]
    shared_w2,             // [K, N]
    moe_output,            // [M, K] from fused_experts
    /*routed_scaling_factor=*/1.0,
    /*inplace=*/true,
    /*use_int8_w8a8=*/false,
    /*use_fp8_w8a16=*/false,
    /*w1_scale=*/std::nullopt,
    /*w2_scale=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*a1_scale=*/std::nullopt,
    /*a2_scale=*/std::nullopt,
    /*is_vnni=*/false);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment