Implementation:Sgl project Sglang CPU MoE

Knowledge Sources	Sgl_project_Sglang
Domains	Machine Learning, CPU Kernels
Last Updated	2026-02-10 00:00 GMT

Overview

Implements the CPU-optimized fused Mixture-of-Experts (MoE) kernel using Intel AMX, including both moe_align_block_size for token routing and fused_moe for the complete gated expert computation.

Description

The fused MoE kernel performs two GEMM operations (gate projection and down projection) with SiLU-and-mul activation fused between them. Key optimizations include:

SiLU activation is fused with the first GEMM output, requiring only 2 intermediate caches instead of 3.
An offsets array in moe_align_block_size tracks starting offsets for each M block, keeping the silu_and_mul output in sorted order so the second GEMM can load activations contiguously.
Helper functions include fill_stub, copy_stub, copy_mul_stub (with topk weight scaling), and sum_stub (accumulating topk expert outputs).
The kernel uses tinygemm_kernel_nn and tinygemm_kernel_nn2 structs for the two GEMM phases, with brgemm support for larger M dimensions.
Weights are stored in VNNI-packed format for optimal AMX utilization.

Two public API functions are exposed: fused_experts_cpu for the standard multi-expert fused MoE computation, and shared_expert_cpu for the shared expert path that adds a scaled residual from the routed experts output.

Usage

Use this kernel for CPU inference of MoE models such as Mixtral, DeepSeek-V2/V3, and Llama4. Call fused_experts_cpu with hidden states, expert weights, topk routing weights and IDs. The kernel supports BFloat16 and FP16 data types, and dispatches to INT8 or FP8 quantized variants based on the moe_comp_method parameter.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/moe.cpp
Lines: 1-1375

Signature

at::Tensor fused_experts_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& topk_weights,
    at::Tensor& topk_ids,
    bool inplace,
    int64_t moe_comp_method,
    const std::optional<at::Tensor>& w1_scale,
    const std::optional<at::Tensor>& w2_scale,
    const std::optional<at::Tensor>& w1_zero,
    const std::optional<at::Tensor>& w2_zero,
    const std::optional<std::vector<int64_t>> block_size,
    bool is_vnni);

at::Tensor shared_expert_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& fused_experts_out,
    double routed_scaling_factor,
    bool inplace,
    bool use_int8_w8a8,
    bool use_fp8_w8a16,
    const std::optional<at::Tensor>& w1_scale,
    const std::optional<at::Tensor>& w2_scale,
    const std::optional<std::vector<int64_t>> block_size,
    bool is_vnni);

// Internal: token routing alignment
int moe_align_block_size(
    /* sorted_ids, expert_ids, offsets, topk_ids, ... */);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
hidden_states	at::Tensor [M, K]	Yes	Input hidden states with M tokens and K hidden dimensions
w1	at::Tensor [E, 2N, K]	Yes	Gate projection weights for E experts (includes gate and up projections)
w2	at::Tensor [E, K, N]	Yes	Down projection weights for E experts
topk_weights	at::Tensor [M, topk]	Yes	Routing weights for selected experts per token (float32)
topk_ids	at::Tensor [M, topk]	Yes	Expert indices selected per token (int32)
inplace	bool	Yes	Whether to write output in-place to hidden_states
moe_comp_method	int64_t	Yes	Quantization method (0=BF16, INT8_W8A8, INT4_W4A8, FP8)
w1_scale	std::optional<at::Tensor>	No	Quantization scales for w1 (required for quantized methods)
w2_scale	std::optional<at::Tensor>	No	Quantization scales for w2 (required for quantized methods)
w1_zero	std::optional<at::Tensor>	No	Zero points for w1 (for INT4 quantization)
w2_zero	std::optional<at::Tensor>	No	Zero points for w2 (for INT4 quantization)
block_size	std::optional<std::vector<int64_t>>	No	Block sizes for block-quantized weights
is_vnni	bool	Yes	Whether weights are already in VNNI-packed format

Outputs

Name	Type	Description
out_hidden_states	at::Tensor [M, K]	MoE output after expert computation, weighted sum, and accumulation

Usage Examples

// Standard fused MoE call with BFloat16 weights
at::Tensor output = fused_experts_cpu(
    hidden_states,       // [M, K]
    w1,                  // [E, 2N, K] gate+up projection
    w2,                  // [E, K, N] down projection
    topk_weights,        // [M, topk] float32
    topk_ids,            // [M, topk] int32
    /*inplace=*/false,
    /*moe_comp_method=*/0,
    /*w1_scale=*/std::nullopt,
    /*w2_scale=*/std::nullopt,
    /*w1_zero=*/std::nullopt,
    /*w2_zero=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*is_vnni=*/true);

// Shared expert with routed scaling
at::Tensor shared_out = shared_expert_cpu(
    hidden_states, w1, w2,
    fused_experts_out,
    /*routed_scaling_factor=*/1.0,
    /*inplace=*/false,
    /*use_int8_w8a8=*/false,
    /*use_fp8_w8a16=*/false,
    std::nullopt, std::nullopt,
    std::nullopt, /*is_vnni=*/true);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment