Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU MoE

From Leeroopedia


Knowledge Sources
Domains Machine Learning, CPU Kernels
Last Updated 2026-02-10 00:00 GMT

Overview

Implements the CPU-optimized fused Mixture-of-Experts (MoE) kernel using Intel AMX, including both moe_align_block_size for token routing and fused_moe for the complete gated expert computation.

Description

The fused MoE kernel performs two GEMM operations (gate projection and down projection) with SiLU-and-mul activation fused between them. Key optimizations include:

  • SiLU activation is fused with the first GEMM output, requiring only 2 intermediate caches instead of 3.
  • An offsets array in moe_align_block_size tracks starting offsets for each M block, keeping the silu_and_mul output in sorted order so the second GEMM can load activations contiguously.
  • Helper functions include fill_stub, copy_stub, copy_mul_stub (with topk weight scaling), and sum_stub (accumulating topk expert outputs).
  • The kernel uses tinygemm_kernel_nn and tinygemm_kernel_nn2 structs for the two GEMM phases, with brgemm support for larger M dimensions.
  • Weights are stored in VNNI-packed format for optimal AMX utilization.

Two public API functions are exposed: fused_experts_cpu for the standard multi-expert fused MoE computation, and shared_expert_cpu for the shared expert path that adds a scaled residual from the routed experts output.

Usage

Use this kernel for CPU inference of MoE models such as Mixtral, DeepSeek-V2/V3, and Llama4. Call fused_experts_cpu with hidden states, expert weights, topk routing weights and IDs. The kernel supports BFloat16 and FP16 data types, and dispatches to INT8 or FP8 quantized variants based on the moe_comp_method parameter.

Code Reference

Source Location

Signature

at::Tensor fused_experts_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& topk_weights,
    at::Tensor& topk_ids,
    bool inplace,
    int64_t moe_comp_method,
    const std::optional<at::Tensor>& w1_scale,
    const std::optional<at::Tensor>& w2_scale,
    const std::optional<at::Tensor>& w1_zero,
    const std::optional<at::Tensor>& w2_zero,
    const std::optional<std::vector<int64_t>> block_size,
    bool is_vnni);

at::Tensor shared_expert_cpu(
    at::Tensor& hidden_states,
    at::Tensor& w1,
    at::Tensor& w2,
    at::Tensor& fused_experts_out,
    double routed_scaling_factor,
    bool inplace,
    bool use_int8_w8a8,
    bool use_fp8_w8a16,
    const std::optional<at::Tensor>& w1_scale,
    const std::optional<at::Tensor>& w2_scale,
    const std::optional<std::vector<int64_t>> block_size,
    bool is_vnni);

// Internal: token routing alignment
int moe_align_block_size(
    /* sorted_ids, expert_ids, offsets, topk_ids, ... */);

Import

#include "common.h"
#include "gemm.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
hidden_states at::Tensor [M, K] Yes Input hidden states with M tokens and K hidden dimensions
w1 at::Tensor [E, 2N, K] Yes Gate projection weights for E experts (includes gate and up projections)
w2 at::Tensor [E, K, N] Yes Down projection weights for E experts
topk_weights at::Tensor [M, topk] Yes Routing weights for selected experts per token (float32)
topk_ids at::Tensor [M, topk] Yes Expert indices selected per token (int32)
inplace bool Yes Whether to write output in-place to hidden_states
moe_comp_method int64_t Yes Quantization method (0=BF16, INT8_W8A8, INT4_W4A8, FP8)
w1_scale std::optional<at::Tensor> No Quantization scales for w1 (required for quantized methods)
w2_scale std::optional<at::Tensor> No Quantization scales for w2 (required for quantized methods)
w1_zero std::optional<at::Tensor> No Zero points for w1 (for INT4 quantization)
w2_zero std::optional<at::Tensor> No Zero points for w2 (for INT4 quantization)
block_size std::optional<std::vector<int64_t>> No Block sizes for block-quantized weights
is_vnni bool Yes Whether weights are already in VNNI-packed format

Outputs

Name Type Description
out_hidden_states at::Tensor [M, K] MoE output after expert computation, weighted sum, and accumulation

Usage Examples

// Standard fused MoE call with BFloat16 weights
at::Tensor output = fused_experts_cpu(
    hidden_states,       // [M, K]
    w1,                  // [E, 2N, K] gate+up projection
    w2,                  // [E, K, N] down projection
    topk_weights,        // [M, topk] float32
    topk_ids,            // [M, topk] int32
    /*inplace=*/false,
    /*moe_comp_method=*/0,
    /*w1_scale=*/std::nullopt,
    /*w2_scale=*/std::nullopt,
    /*w1_zero=*/std::nullopt,
    /*w2_zero=*/std::nullopt,
    /*block_size=*/std::nullopt,
    /*is_vnni=*/true);

// Shared expert with routed scaling
at::Tensor shared_out = shared_expert_cpu(
    hidden_states, w1, w2,
    fused_experts_out,
    /*routed_scaling_factor=*/1.0,
    /*inplace=*/false,
    /*use_int8_w8a8=*/false,
    /*use_fp8_w8a16=*/false,
    std::nullopt, std::nullopt,
    std::nullopt, /*is_vnni=*/true);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment