Implementation:Sgl project Sglang CPU MoE
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, CPU Kernels |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements the CPU-optimized fused Mixture-of-Experts (MoE) kernel using Intel AMX, including both moe_align_block_size for token routing and fused_moe for the complete gated expert computation.
Description
The fused MoE kernel performs two GEMM operations (gate projection and down projection) with SiLU-and-mul activation fused between them. Key optimizations include:
- SiLU activation is fused with the first GEMM output, requiring only 2 intermediate caches instead of 3.
- An offsets array in moe_align_block_size tracks starting offsets for each M block, keeping the silu_and_mul output in sorted order so the second GEMM can load activations contiguously.
- Helper functions include fill_stub, copy_stub, copy_mul_stub (with topk weight scaling), and sum_stub (accumulating topk expert outputs).
- The kernel uses tinygemm_kernel_nn and tinygemm_kernel_nn2 structs for the two GEMM phases, with brgemm support for larger M dimensions.
- Weights are stored in VNNI-packed format for optimal AMX utilization.
Two public API functions are exposed: fused_experts_cpu for the standard multi-expert fused MoE computation, and shared_expert_cpu for the shared expert path that adds a scaled residual from the routed experts output.
Usage
Use this kernel for CPU inference of MoE models such as Mixtral, DeepSeek-V2/V3, and Llama4. Call fused_experts_cpu with hidden states, expert weights, topk routing weights and IDs. The kernel supports BFloat16 and FP16 data types, and dispatches to INT8 or FP8 quantized variants based on the moe_comp_method parameter.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/moe.cpp
- Lines: 1-1375
Signature
at::Tensor fused_experts_cpu(
at::Tensor& hidden_states,
at::Tensor& w1,
at::Tensor& w2,
at::Tensor& topk_weights,
at::Tensor& topk_ids,
bool inplace,
int64_t moe_comp_method,
const std::optional<at::Tensor>& w1_scale,
const std::optional<at::Tensor>& w2_scale,
const std::optional<at::Tensor>& w1_zero,
const std::optional<at::Tensor>& w2_zero,
const std::optional<std::vector<int64_t>> block_size,
bool is_vnni);
at::Tensor shared_expert_cpu(
at::Tensor& hidden_states,
at::Tensor& w1,
at::Tensor& w2,
at::Tensor& fused_experts_out,
double routed_scaling_factor,
bool inplace,
bool use_int8_w8a8,
bool use_fp8_w8a16,
const std::optional<at::Tensor>& w1_scale,
const std::optional<at::Tensor>& w2_scale,
const std::optional<std::vector<int64_t>> block_size,
bool is_vnni);
// Internal: token routing alignment
int moe_align_block_size(
/* sorted_ids, expert_ids, offsets, topk_ids, ... */);
Import
#include "common.h"
#include "gemm.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_states | at::Tensor [M, K] | Yes | Input hidden states with M tokens and K hidden dimensions |
| w1 | at::Tensor [E, 2N, K] | Yes | Gate projection weights for E experts (includes gate and up projections) |
| w2 | at::Tensor [E, K, N] | Yes | Down projection weights for E experts |
| topk_weights | at::Tensor [M, topk] | Yes | Routing weights for selected experts per token (float32) |
| topk_ids | at::Tensor [M, topk] | Yes | Expert indices selected per token (int32) |
| inplace | bool | Yes | Whether to write output in-place to hidden_states |
| moe_comp_method | int64_t | Yes | Quantization method (0=BF16, INT8_W8A8, INT4_W4A8, FP8) |
| w1_scale | std::optional<at::Tensor> | No | Quantization scales for w1 (required for quantized methods) |
| w2_scale | std::optional<at::Tensor> | No | Quantization scales for w2 (required for quantized methods) |
| w1_zero | std::optional<at::Tensor> | No | Zero points for w1 (for INT4 quantization) |
| w2_zero | std::optional<at::Tensor> | No | Zero points for w2 (for INT4 quantization) |
| block_size | std::optional<std::vector<int64_t>> | No | Block sizes for block-quantized weights |
| is_vnni | bool | Yes | Whether weights are already in VNNI-packed format |
Outputs
| Name | Type | Description |
|---|---|---|
| out_hidden_states | at::Tensor [M, K] | MoE output after expert computation, weighted sum, and accumulation |
Usage Examples
// Standard fused MoE call with BFloat16 weights
at::Tensor output = fused_experts_cpu(
hidden_states, // [M, K]
w1, // [E, 2N, K] gate+up projection
w2, // [E, K, N] down projection
topk_weights, // [M, topk] float32
topk_ids, // [M, topk] int32
/*inplace=*/false,
/*moe_comp_method=*/0,
/*w1_scale=*/std::nullopt,
/*w2_scale=*/std::nullopt,
/*w1_zero=*/std::nullopt,
/*w2_zero=*/std::nullopt,
/*block_size=*/std::nullopt,
/*is_vnni=*/true);
// Shared expert with routed scaling
at::Tensor shared_out = shared_expert_cpu(
hidden_states, w1, w2,
fused_experts_out,
/*routed_scaling_factor=*/1.0,
/*inplace=*/false,
/*use_int8_w8a8=*/false,
/*use_fp8_w8a16=*/false,
std::nullopt, std::nullopt,
std::nullopt, /*is_vnni=*/true);