Implementation:Vllm project Vllm CPU Fused MoE
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, MoE, GEMM |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements fused Mixture-of-Experts (MoE) computation on CPU with gated activation functions merged into GEMM operations to reduce memory bandwidth.
Description
This file provides a CPU-optimized fused MoE implementation that combines two GEMM operations (w13 and w2) with SiLU or SwigluOAI gated activations into a single fused pipeline. It dispatches to ISA-specific micro-GEMM kernels (AMX for Intel or generic VEC) and supports BF16/FP16 data types. The implementation uses cache-aware tiling, OpenMP parallelism, and scratchpad memory management to maximize throughput on multi-core CPUs.
Usage
This code is compiled as part of the vLLM CPU backend and is invoked during MoE model inference when running on CPU. It is called via the cpu_fused_moe and prepack_moe_weight torch extension functions from the Python layer.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/cpu_fused_moe.cpp
- Lines: 1-727
Signature
void prepack_moe_weight(
const torch::Tensor& weight, // [expert_num, output_size, input_size]
torch::Tensor& packed_weight,
const std::string& isa);
void cpu_fused_moe(
torch::Tensor& output, // [token_num, output_size_2]
const torch::Tensor& input, // [token_num, input_size_13]
const torch::Tensor& w13, // [expert_num, output_size_13, input_size_13]
const torch::Tensor& w2, // [expert_num, output_size_2, input_size_2]
const std::optional<torch::Tensor>& w13_bias, // [expert_num, output_size_13]
const std::optional<torch::Tensor>& w2_bias, // [expert_num, output_size_2]
const torch::Tensor& topk_weights, // [token_num, k], float32
const torch::Tensor& topk_id, // [token_num, k], int32
const std::string& act,
const std::string& isa);
Import
#include "cpu/cpu_types.hpp"
#include "cpu/utils.hpp"
#include "cpu/micro_gemm/cpu_micro_gemm_vec.hpp"
#include "cpu/cpu_arch_macros.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output | torch::Tensor | Yes | Output tensor of shape [token_num, output_size_2] |
| input | torch::Tensor | Yes | Input activations of shape [token_num, input_size_13] |
| w13 | torch::Tensor | Yes | Packed gate+up weight tensor of shape [expert_num, output_size_13, input_size_13] |
| w2 | torch::Tensor | Yes | Packed down projection weight tensor of shape [expert_num, output_size_2, input_size_2] |
| w13_bias | std::optional<torch::Tensor> | No | Optional bias for w13 gate+up projection |
| w2_bias | std::optional<torch::Tensor> | No | Optional bias for w2 down projection |
| topk_weights | torch::Tensor | Yes | Top-k expert routing weights of shape [token_num, k], float32 |
| topk_id | torch::Tensor | Yes | Top-k expert indices of shape [token_num, k], int32 |
| act | std::string | Yes | Activation type: "silu" or "swigluoai" |
| isa | std::string | Yes | ISA hint: "amx" or "vec" |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch::Tensor | In-place result of the fused MoE computation, shape [token_num, output_size_2] |
Usage Examples
// Prepack weights for efficient GEMM
prepack_moe_weight(w13_weight, packed_w13, "amx");
prepack_moe_weight(w2_weight, packed_w2, "amx");
// Run fused MoE inference
cpu_fused_moe(
output, input, packed_w13, packed_w2,
w13_bias, w2_bias,
topk_weights, topk_ids,
"silu", // activation type
"amx" // ISA type
);