Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm CPU Fused MoE

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, MoE, GEMM
Last Updated 2026-02-08 00:00 GMT

Overview

Implements fused Mixture-of-Experts (MoE) computation on CPU with gated activation functions merged into GEMM operations to reduce memory bandwidth.

Description

This file provides a CPU-optimized fused MoE implementation that combines two GEMM operations (w13 and w2) with SiLU or SwigluOAI gated activations into a single fused pipeline. It dispatches to ISA-specific micro-GEMM kernels (AMX for Intel or generic VEC) and supports BF16/FP16 data types. The implementation uses cache-aware tiling, OpenMP parallelism, and scratchpad memory management to maximize throughput on multi-core CPUs.

Usage

This code is compiled as part of the vLLM CPU backend and is invoked during MoE model inference when running on CPU. It is called via the cpu_fused_moe and prepack_moe_weight torch extension functions from the Python layer.

Code Reference

Source Location

Signature

void prepack_moe_weight(
    const torch::Tensor& weight,   // [expert_num, output_size, input_size]
    torch::Tensor& packed_weight,
    const std::string& isa);

void cpu_fused_moe(
    torch::Tensor& output,                          // [token_num, output_size_2]
    const torch::Tensor& input,                     // [token_num, input_size_13]
    const torch::Tensor& w13,                       // [expert_num, output_size_13, input_size_13]
    const torch::Tensor& w2,                        // [expert_num, output_size_2, input_size_2]
    const std::optional<torch::Tensor>& w13_bias,   // [expert_num, output_size_13]
    const std::optional<torch::Tensor>& w2_bias,    // [expert_num, output_size_2]
    const torch::Tensor& topk_weights,              // [token_num, k], float32
    const torch::Tensor& topk_id,                   // [token_num, k], int32
    const std::string& act,
    const std::string& isa);

Import

#include "cpu/cpu_types.hpp"
#include "cpu/utils.hpp"
#include "cpu/micro_gemm/cpu_micro_gemm_vec.hpp"
#include "cpu/cpu_arch_macros.h"

I/O Contract

Inputs

Name Type Required Description
output torch::Tensor Yes Output tensor of shape [token_num, output_size_2]
input torch::Tensor Yes Input activations of shape [token_num, input_size_13]
w13 torch::Tensor Yes Packed gate+up weight tensor of shape [expert_num, output_size_13, input_size_13]
w2 torch::Tensor Yes Packed down projection weight tensor of shape [expert_num, output_size_2, input_size_2]
w13_bias std::optional<torch::Tensor> No Optional bias for w13 gate+up projection
w2_bias std::optional<torch::Tensor> No Optional bias for w2 down projection
topk_weights torch::Tensor Yes Top-k expert routing weights of shape [token_num, k], float32
topk_id torch::Tensor Yes Top-k expert indices of shape [token_num, k], int32
act std::string Yes Activation type: "silu" or "swigluoai"
isa std::string Yes ISA hint: "amx" or "vec"

Outputs

Name Type Description
output torch::Tensor In-place result of the fused MoE computation, shape [token_num, output_size_2]

Usage Examples

// Prepack weights for efficient GEMM
prepack_moe_weight(w13_weight, packed_w13, "amx");
prepack_moe_weight(w2_weight, packed_w2, "amx");

// Run fused MoE inference
cpu_fused_moe(
    output, input, packed_w13, packed_w2,
    w13_bias, w2_bias,
    topk_weights, topk_ids,
    "silu",   // activation type
    "amx"     // ISA type
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment