Implementation:Sgl project Sglang CPU TopK

Knowledge Sources	Sgl_project_Sglang
Domains	Machine Learning, CPU Kernels
Last Updated	2026-02-10 00:00 GMT

Overview

Implements CPU-optimized top-k selection kernels for Mixture-of-Experts (MoE) routing, supporting sigmoid, softmax, and grouped top-k variants with vectorized SIMD operations.

Description

This file provides four top-k selection strategies for MoE expert routing:

topk_sigmoid_kernel_impl -- Computes sigmoid scores over gating outputs and selects the top-k expert per token. Currently supports topk=1 only.
topk_softmax_kernel_impl -- Applies vectorized softmax over experts, then selects top-k. The softmax template function is a compile-time-sized SIMD implementation with three passes: find max, compute exp and sum, and normalize.
grouped_topk_kernel_impl -- Implements DeepSeek V2-style grouped expert routing: computes softmax scores, finds per-group maximum scores, selects topk_group groups, then extracts the top-k experts from those groups. Parallelized across tokens via at::parallel_for.
biased_grouped_topk_kernel_impl -- DeepSeek V3/R1 variant that adds a correction_bias to sigmoid scores before group and expert selection.

The softmax implementation handles arbitrary compile-time expert counts (1 to 512) using SIMD intrinsics via PyTorch's at::vec::Vectorized abstraction. Small sizes (< vector width) use masked loads and stores, while larger sizes use fully unrolled vector loops. The vec_reduce_max and vec_reduce_sum helpers perform efficient horizontal reductions.

Public API functions: topk_sigmoid_cpu, topk_softmax_cpu, grouped_topk_cpu, and biased_grouped_topk_cpu. All return a tuple of (topk_weights, topk_ids).

Usage

Use these kernels during CPU MoE inference for expert routing. The specific variant depends on the model architecture: topk_sigmoid for simple sigmoid gating, topk_softmax for softmax-based routing, grouped_topk for DeepSeek V2-style grouped routing, and biased_grouped_topk for DeepSeek V3/R1 with correction bias. Supported expert counts include 1, 2, 4, 8, 16, 32, 64, 128, 160, 256, 384, and 512.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/topk.cpp
Lines: 1-671

Signature

// Public API functions
std::tuple<at::Tensor, at::Tensor>
topk_sigmoid_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize);

std::tuple<at::Tensor, at::Tensor>
topk_softmax_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize);

std::tuple<at::Tensor, at::Tensor>
grouped_topk_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize,
    int64_t num_expert_group,
    int64_t topk_group,
    int64_t num_fused_shared_experts,
    std::optional<double> routed_scaling_factor,
    std::optional<at::Tensor> num_token_non_padded);

std::tuple<at::Tensor, at::Tensor>
biased_grouped_topk_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    at::Tensor& correction_bias,
    int64_t topk,
    bool renormalize,
    int64_t num_expert_group,
    int64_t topk_group,
    int64_t num_fused_shared_experts,
    std::optional<double> routed_scaling_factor,
    std::optional<at::Tensor> num_token_non_padded);

// Internal: SIMD softmax with compile-time size
template <typename scalar_t, int SIZE>
inline void softmax(float* __restrict__ out, const scalar_t* __restrict__ input);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
hidden_states	at::Tensor [num_tokens, hidden_size]	Yes	Input hidden states (used for dtype and num_tokens inference)
gating_output	at::Tensor [num_tokens, num_experts]	Yes	Router logits from the gating network (BFloat16 or Half)
topk	int64_t	Yes	Number of experts to select per token
renormalize	bool	Yes	Whether to renormalize topk weights to sum to 1
num_expert_group	int64_t	Depends	Number of expert groups (for grouped variants)
topk_group	int64_t	Depends	Number of groups to select (for grouped variants)
correction_bias	at::Tensor [num_experts]	Depends	Bias added to scores before selection (biased variant only)
num_fused_shared_experts	int64_t	Depends	Number of fused shared experts (must be 0 currently)
routed_scaling_factor	std::optional<double>	No	Scaling factor for routed experts (must be None or 1.0 currently)

Outputs

Name	Type	Description
topk_weights	at::Tensor [num_tokens, topk]	Selected expert routing weights (float32)
topk_ids	at::Tensor [num_tokens, topk]	Selected expert indices (int32)

Usage Examples

// Simple softmax top-k routing
auto [weights, ids] = topk_softmax_cpu(
    hidden_states,          // [num_tokens, hidden_size]
    gating_output,          // [num_tokens, num_experts]
    /*topk=*/2,
    /*renormalize=*/true);

// DeepSeek V2-style grouped top-k
auto [weights, ids] = grouped_topk_cpu(
    hidden_states,
    gating_output,          // [num_tokens, 160]
    /*topk=*/6,
    /*renormalize=*/true,
    /*num_expert_group=*/8,
    /*topk_group=*/3,
    /*num_fused_shared_experts=*/0,
    /*routed_scaling_factor=*/std::nullopt,
    /*num_token_non_padded=*/std::nullopt);

// DeepSeek V3/R1 biased grouped top-k
auto [weights, ids] = biased_grouped_topk_cpu(
    hidden_states,
    gating_output,          // [num_tokens, 256]
    correction_bias,        // [256]
    /*topk=*/8,
    /*renormalize=*/true,
    /*num_expert_group=*/8,
    /*topk_group=*/4,
    /*num_fused_shared_experts=*/0,
    /*routed_scaling_factor=*/std::nullopt,
    /*num_token_non_padded=*/std::nullopt);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment