Implementation:Vllm project Vllm CPU Fused MoE

Knowledge Sources	vllm
Domains	CPU_Inference, MoE, GEMM
Last Updated	2026-02-08 00:00 GMT

Overview

Implements fused Mixture-of-Experts (MoE) computation on CPU with gated activation functions merged into GEMM operations to reduce memory bandwidth.

Description

This file provides a CPU-optimized fused MoE implementation that combines two GEMM operations (w13 and w2) with SiLU or SwigluOAI gated activations into a single fused pipeline. It dispatches to ISA-specific micro-GEMM kernels (AMX for Intel or generic VEC) and supports BF16/FP16 data types. The implementation uses cache-aware tiling, OpenMP parallelism, and scratchpad memory management to maximize throughput on multi-core CPUs.

Usage

This code is compiled as part of the vLLM CPU backend and is invoked during MoE model inference when running on CPU. It is called via the cpu_fused_moe and prepack_moe_weight torch extension functions from the Python layer.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/cpu_fused_moe.cpp
Lines: 1-727

Signature

void prepack_moe_weight(
    const torch::Tensor& weight,   // [expert_num, output_size, input_size]
    torch::Tensor& packed_weight,
    const std::string& isa);

void cpu_fused_moe(
    torch::Tensor& output,                          // [token_num, output_size_2]
    const torch::Tensor& input,                     // [token_num, input_size_13]
    const torch::Tensor& w13,                       // [expert_num, output_size_13, input_size_13]
    const torch::Tensor& w2,                        // [expert_num, output_size_2, input_size_2]
    const std::optional<torch::Tensor>& w13_bias,   // [expert_num, output_size_13]
    const std::optional<torch::Tensor>& w2_bias,    // [expert_num, output_size_2]
    const torch::Tensor& topk_weights,              // [token_num, k], float32
    const torch::Tensor& topk_id,                   // [token_num, k], int32
    const std::string& act,
    const std::string& isa);

Import

#include "cpu/cpu_types.hpp"
#include "cpu/utils.hpp"
#include "cpu/micro_gemm/cpu_micro_gemm_vec.hpp"
#include "cpu/cpu_arch_macros.h"

I/O Contract

Inputs

Name	Type	Required	Description
output	torch::Tensor	Yes	Output tensor of shape [token_num, output_size_2]
input	torch::Tensor	Yes	Input activations of shape [token_num, input_size_13]
w13	torch::Tensor	Yes	Packed gate+up weight tensor of shape [expert_num, output_size_13, input_size_13]
w2	torch::Tensor	Yes	Packed down projection weight tensor of shape [expert_num, output_size_2, input_size_2]
w13_bias	std::optional<torch::Tensor>	No	Optional bias for w13 gate+up projection
w2_bias	std::optional<torch::Tensor>	No	Optional bias for w2 down projection
topk_weights	torch::Tensor	Yes	Top-k expert routing weights of shape [token_num, k], float32
topk_id	torch::Tensor	Yes	Top-k expert indices of shape [token_num, k], int32
act	std::string	Yes	Activation type: "silu" or "swigluoai"
isa	std::string	Yes	ISA hint: "amx" or "vec"

Outputs

Name	Type	Description
output	torch::Tensor	In-place result of the fused MoE computation, shape [token_num, output_size_2]

Usage Examples

// Prepack weights for efficient GEMM
prepack_moe_weight(w13_weight, packed_w13, "amx");
prepack_moe_weight(w2_weight, packed_w2, "amx");

// Run fused MoE inference
cpu_fused_moe(
    output, input, packed_w13, packed_w2,
    w13_bias, w2_bias,
    topk_weights, topk_ids,
    "silu",   // activation type
    "amx"     // ISA type
);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment