Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Dynamic 4bit Int MoE CPU

From Leeroopedia


Knowledge Sources
Domains MoE, Quantization, CPU_Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Implements CPU-based dynamic 4-bit integer Mixture-of-Experts inference using ARM AArch64 quantized matrix multiplication instructions.

Description

This file provides the dynamic_4bit_int_moe_cpu function, which performs MoE forward passes entirely on CPU using dynamic 4-bit quantized GEMM operations. It routes tokens to experts via top-k bucketing, executes expert-wise W13 and W2 matrix multiplications using ARM's _dyn_quant_matmul_4bit intrinsic, and applies SiLU or SwiGLU activation functions between the two linear layers. The implementation supports optional router weight scaling on either the input or output side.

Usage

Use this when running MoE model inference on AArch64 (ARM64) CPUs without GPU acceleration. It is specifically designed for edge or server deployments on ARM hardware where 4-bit quantized weights reduce memory footprint and leverage native ARM GEMM instructions for performance.

Code Reference

Source Location

Signature

// Helper: 4-bit quantized matmul (AArch64 only)
inline torch::Tensor mm(const torch::Tensor& a, const torch::Tensor& packed_w,
                        int64_t group_size_eff, int64_t in_features,
                        int64_t out_features);

// Activation enum
enum ActivationKind : int64_t {
  SwiGLU_Gu = 0,  // act = SiLU(g) * u
  SwiGLUOAI = 1,  // act = SiLU(u) * g
  SiLU = 2        // SiLU
};

// Main MoE function
torch::Tensor dynamic_4bit_int_moe_cpu(
    torch::Tensor x, torch::Tensor topk_ids, torch::Tensor topk_weights,
    torch::Tensor w13_packed, torch::Tensor w2_packed, int64_t H, int64_t I,
    int64_t I2, int64_t group_size, bool apply_router_weight_on_input,
    int64_t activation_kind);

Import

#include <ATen/ATen.h>
#include <ATen/Parallel.h>
#include <torch/all.h>
// AArch64 only:
#include <ATen/ops/_dyn_quant_matmul_4bit.h>

I/O Contract

Inputs

Name Type Required Description
x torch::Tensor [T, H] Yes Input token hidden states (2D)
topk_ids torch::Tensor [T, K] Yes Expert indices per token from the router
topk_weights torch::Tensor [T, K] Yes Router gating weights per token
w13_packed torch::Tensor [E, ...] Yes Packed 4-bit weights for the W1/W3 (gate/up) projection per expert
w2_packed torch::Tensor [E, ...] Yes Packed 4-bit weights for the W2 (down) projection per expert
H int64_t Yes Hidden dimension size
I int64_t Yes Intermediate dimension size
I2 int64_t Yes Must equal 2*I (combined gate+up dimension)
group_size int64_t Yes Quantization group size (-1 for per-tensor)
apply_router_weight_on_input bool Yes Whether to scale input (true) or output (false) by router weights
activation_kind int64_t Yes Activation type: 0=SwiGLU_Gu, 1=SwiGLUOAI, 2=SiLU

Outputs

Name Type Description
out torch::Tensor [T, H] Aggregated MoE output with expert contributions summed per token

Usage Examples

// Called from Python via torch extension binding
// x: [batch_size, hidden_dim], topk_ids/topk_weights: [batch_size, top_k]
auto result = dynamic_4bit_int_moe_cpu(
    x,            // input tensor [T, H]
    topk_ids,     // expert assignments [T, K]
    topk_weights, // gating weights [T, K]
    w13_packed,   // packed 4-bit gate+up weights [E, ...]
    w2_packed,    // packed 4-bit down weights [E, ...]
    /*H=*/4096,
    /*I=*/11008,
    /*I2=*/22016,
    /*group_size=*/128,
    /*apply_router_weight_on_input=*/false,
    /*activation_kind=*/0  // SwiGLU_Gu
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment