Implementation:Vllm project Vllm Dynamic 4bit Int MoE CPU

Knowledge Sources	vllm
Domains	MoE, Quantization, CPU_Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Implements CPU-based dynamic 4-bit integer Mixture-of-Experts inference using ARM AArch64 quantized matrix multiplication instructions.

Description

This file provides the dynamic_4bit_int_moe_cpu function, which performs MoE forward passes entirely on CPU using dynamic 4-bit quantized GEMM operations. It routes tokens to experts via top-k bucketing, executes expert-wise W13 and W2 matrix multiplications using ARM's _dyn_quant_matmul_4bit intrinsic, and applies SiLU or SwiGLU activation functions between the two linear layers. The implementation supports optional router weight scaling on either the input or output side.

Usage

Use this when running MoE model inference on AArch64 (ARM64) CPUs without GPU acceleration. It is specifically designed for edge or server deployments on ARM hardware where 4-bit quantized weights reduce memory footprint and leverage native ARM GEMM instructions for performance.

Code Reference

Source Location

Repository: vllm
File: csrc/moe/dynamic_4bit_int_moe_cpu.cpp
Lines: 1-147

Signature

// Helper: 4-bit quantized matmul (AArch64 only)
inline torch::Tensor mm(const torch::Tensor& a, const torch::Tensor& packed_w,
                        int64_t group_size_eff, int64_t in_features,
                        int64_t out_features);

// Activation enum
enum ActivationKind : int64_t {
  SwiGLU_Gu = 0,  // act = SiLU(g) * u
  SwiGLUOAI = 1,  // act = SiLU(u) * g
  SiLU = 2        // SiLU
};

// Main MoE function
torch::Tensor dynamic_4bit_int_moe_cpu(
    torch::Tensor x, torch::Tensor topk_ids, torch::Tensor topk_weights,
    torch::Tensor w13_packed, torch::Tensor w2_packed, int64_t H, int64_t I,
    int64_t I2, int64_t group_size, bool apply_router_weight_on_input,
    int64_t activation_kind);

Import

#include <ATen/ATen.h>
#include <ATen/Parallel.h>
#include <torch/all.h>
// AArch64 only:
#include <ATen/ops/_dyn_quant_matmul_4bit.h>

I/O Contract

Inputs

Name	Type	Required	Description
x	torch::Tensor [T, H]	Yes	Input token hidden states (2D)
topk_ids	torch::Tensor [T, K]	Yes	Expert indices per token from the router
topk_weights	torch::Tensor [T, K]	Yes	Router gating weights per token
w13_packed	torch::Tensor [E, ...]	Yes	Packed 4-bit weights for the W1/W3 (gate/up) projection per expert
w2_packed	torch::Tensor [E, ...]	Yes	Packed 4-bit weights for the W2 (down) projection per expert
H	int64_t	Yes	Hidden dimension size
I	int64_t	Yes	Intermediate dimension size
I2	int64_t	Yes	Must equal 2*I (combined gate+up dimension)
group_size	int64_t	Yes	Quantization group size (-1 for per-tensor)
apply_router_weight_on_input	bool	Yes	Whether to scale input (true) or output (false) by router weights
activation_kind	int64_t	Yes	Activation type: 0=SwiGLU_Gu, 1=SwiGLUOAI, 2=SiLU

Outputs

Name	Type	Description
out	torch::Tensor [T, H]	Aggregated MoE output with expert contributions summed per token

Usage Examples

// Called from Python via torch extension binding
// x: [batch_size, hidden_dim], topk_ids/topk_weights: [batch_size, top_k]
auto result = dynamic_4bit_int_moe_cpu(
    x,            // input tensor [T, H]
    topk_ids,     // expert assignments [T, K]
    topk_weights, // gating weights [T, K]
    w13_packed,   // packed 4-bit gate+up weights [E, ...]
    w2_packed,    // packed 4-bit down weights [E, ...]
    /*H=*/4096,
    /*I=*/11008,
    /*I2=*/22016,
    /*group_size=*/128,
    /*apply_router_weight_on_input=*/false,
    /*activation_kind=*/0  // SwiGLU_Gu
);

Related Pages

Environment:Vllm_project_Vllm_AArch64_CPU

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment