Implementation:Vllm project Vllm Dynamic 4bit Int MoE CPU
| Knowledge Sources | |
|---|---|
| Domains | MoE, Quantization, CPU_Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements CPU-based dynamic 4-bit integer Mixture-of-Experts inference using ARM AArch64 quantized matrix multiplication instructions.
Description
This file provides the dynamic_4bit_int_moe_cpu function, which performs MoE forward passes entirely on CPU using dynamic 4-bit quantized GEMM operations. It routes tokens to experts via top-k bucketing, executes expert-wise W13 and W2 matrix multiplications using ARM's _dyn_quant_matmul_4bit intrinsic, and applies SiLU or SwiGLU activation functions between the two linear layers. The implementation supports optional router weight scaling on either the input or output side.
Usage
Use this when running MoE model inference on AArch64 (ARM64) CPUs without GPU acceleration. It is specifically designed for edge or server deployments on ARM hardware where 4-bit quantized weights reduce memory footprint and leverage native ARM GEMM instructions for performance.
Code Reference
Source Location
- Repository: vllm
- File: csrc/moe/dynamic_4bit_int_moe_cpu.cpp
- Lines: 1-147
Signature
// Helper: 4-bit quantized matmul (AArch64 only)
inline torch::Tensor mm(const torch::Tensor& a, const torch::Tensor& packed_w,
int64_t group_size_eff, int64_t in_features,
int64_t out_features);
// Activation enum
enum ActivationKind : int64_t {
SwiGLU_Gu = 0, // act = SiLU(g) * u
SwiGLUOAI = 1, // act = SiLU(u) * g
SiLU = 2 // SiLU
};
// Main MoE function
torch::Tensor dynamic_4bit_int_moe_cpu(
torch::Tensor x, torch::Tensor topk_ids, torch::Tensor topk_weights,
torch::Tensor w13_packed, torch::Tensor w2_packed, int64_t H, int64_t I,
int64_t I2, int64_t group_size, bool apply_router_weight_on_input,
int64_t activation_kind);
Import
#include <ATen/ATen.h>
#include <ATen/Parallel.h>
#include <torch/all.h>
// AArch64 only:
#include <ATen/ops/_dyn_quant_matmul_4bit.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | torch::Tensor [T, H] | Yes | Input token hidden states (2D) |
| topk_ids | torch::Tensor [T, K] | Yes | Expert indices per token from the router |
| topk_weights | torch::Tensor [T, K] | Yes | Router gating weights per token |
| w13_packed | torch::Tensor [E, ...] | Yes | Packed 4-bit weights for the W1/W3 (gate/up) projection per expert |
| w2_packed | torch::Tensor [E, ...] | Yes | Packed 4-bit weights for the W2 (down) projection per expert |
| H | int64_t | Yes | Hidden dimension size |
| I | int64_t | Yes | Intermediate dimension size |
| I2 | int64_t | Yes | Must equal 2*I (combined gate+up dimension) |
| group_size | int64_t | Yes | Quantization group size (-1 for per-tensor) |
| apply_router_weight_on_input | bool | Yes | Whether to scale input (true) or output (false) by router weights |
| activation_kind | int64_t | Yes | Activation type: 0=SwiGLU_Gu, 1=SwiGLUOAI, 2=SiLU |
Outputs
| Name | Type | Description |
|---|---|---|
| out | torch::Tensor [T, H] | Aggregated MoE output with expert contributions summed per token |
Usage Examples
// Called from Python via torch extension binding
// x: [batch_size, hidden_dim], topk_ids/topk_weights: [batch_size, top_k]
auto result = dynamic_4bit_int_moe_cpu(
x, // input tensor [T, H]
topk_ids, // expert assignments [T, K]
topk_weights, // gating weights [T, K]
w13_packed, // packed 4-bit gate+up weights [E, ...]
w2_packed, // packed 4-bit down weights [E, ...]
/*H=*/4096,
/*I=*/11008,
/*I2=*/22016,
/*group_size=*/128,
/*apply_router_weight_on_input=*/false,
/*activation_kind=*/0 // SwiGLU_Gu
);