Implementation:Sgl project Sglang CPU TopK
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, CPU Kernels |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements CPU-optimized top-k selection kernels for Mixture-of-Experts (MoE) routing, supporting sigmoid, softmax, and grouped top-k variants with vectorized SIMD operations.
Description
This file provides four top-k selection strategies for MoE expert routing:
- topk_sigmoid_kernel_impl -- Computes sigmoid scores over gating outputs and selects the top-k expert per token. Currently supports topk=1 only.
- topk_softmax_kernel_impl -- Applies vectorized softmax over experts, then selects top-k. The softmax template function is a compile-time-sized SIMD implementation with three passes: find max, compute exp and sum, and normalize.
- grouped_topk_kernel_impl -- Implements DeepSeek V2-style grouped expert routing: computes softmax scores, finds per-group maximum scores, selects topk_group groups, then extracts the top-k experts from those groups. Parallelized across tokens via at::parallel_for.
- biased_grouped_topk_kernel_impl -- DeepSeek V3/R1 variant that adds a correction_bias to sigmoid scores before group and expert selection.
The softmax implementation handles arbitrary compile-time expert counts (1 to 512) using SIMD intrinsics via PyTorch's at::vec::Vectorized abstraction. Small sizes (< vector width) use masked loads and stores, while larger sizes use fully unrolled vector loops. The vec_reduce_max and vec_reduce_sum helpers perform efficient horizontal reductions.
Public API functions: topk_sigmoid_cpu, topk_softmax_cpu, grouped_topk_cpu, and biased_grouped_topk_cpu. All return a tuple of (topk_weights, topk_ids).
Usage
Use these kernels during CPU MoE inference for expert routing. The specific variant depends on the model architecture: topk_sigmoid for simple sigmoid gating, topk_softmax for softmax-based routing, grouped_topk for DeepSeek V2-style grouped routing, and biased_grouped_topk for DeepSeek V3/R1 with correction bias. Supported expert counts include 1, 2, 4, 8, 16, 32, 64, 128, 160, 256, 384, and 512.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/topk.cpp
- Lines: 1-671
Signature
// Public API functions
std::tuple<at::Tensor, at::Tensor>
topk_sigmoid_cpu(
at::Tensor& hidden_states,
at::Tensor& gating_output,
int64_t topk,
bool renormalize);
std::tuple<at::Tensor, at::Tensor>
topk_softmax_cpu(
at::Tensor& hidden_states,
at::Tensor& gating_output,
int64_t topk,
bool renormalize);
std::tuple<at::Tensor, at::Tensor>
grouped_topk_cpu(
at::Tensor& hidden_states,
at::Tensor& gating_output,
int64_t topk,
bool renormalize,
int64_t num_expert_group,
int64_t topk_group,
int64_t num_fused_shared_experts,
std::optional<double> routed_scaling_factor,
std::optional<at::Tensor> num_token_non_padded);
std::tuple<at::Tensor, at::Tensor>
biased_grouped_topk_cpu(
at::Tensor& hidden_states,
at::Tensor& gating_output,
at::Tensor& correction_bias,
int64_t topk,
bool renormalize,
int64_t num_expert_group,
int64_t topk_group,
int64_t num_fused_shared_experts,
std::optional<double> routed_scaling_factor,
std::optional<at::Tensor> num_token_non_padded);
// Internal: SIMD softmax with compile-time size
template <typename scalar_t, int SIZE>
inline void softmax(float* __restrict__ out, const scalar_t* __restrict__ input);
Import
#include "common.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_states | at::Tensor [num_tokens, hidden_size] | Yes | Input hidden states (used for dtype and num_tokens inference) |
| gating_output | at::Tensor [num_tokens, num_experts] | Yes | Router logits from the gating network (BFloat16 or Half) |
| topk | int64_t | Yes | Number of experts to select per token |
| renormalize | bool | Yes | Whether to renormalize topk weights to sum to 1 |
| num_expert_group | int64_t | Depends | Number of expert groups (for grouped variants) |
| topk_group | int64_t | Depends | Number of groups to select (for grouped variants) |
| correction_bias | at::Tensor [num_experts] | Depends | Bias added to scores before selection (biased variant only) |
| num_fused_shared_experts | int64_t | Depends | Number of fused shared experts (must be 0 currently) |
| routed_scaling_factor | std::optional<double> | No | Scaling factor for routed experts (must be None or 1.0 currently) |
Outputs
| Name | Type | Description |
|---|---|---|
| topk_weights | at::Tensor [num_tokens, topk] | Selected expert routing weights (float32) |
| topk_ids | at::Tensor [num_tokens, topk] | Selected expert indices (int32) |
Usage Examples
// Simple softmax top-k routing
auto [weights, ids] = topk_softmax_cpu(
hidden_states, // [num_tokens, hidden_size]
gating_output, // [num_tokens, num_experts]
/*topk=*/2,
/*renormalize=*/true);
// DeepSeek V2-style grouped top-k
auto [weights, ids] = grouped_topk_cpu(
hidden_states,
gating_output, // [num_tokens, 160]
/*topk=*/6,
/*renormalize=*/true,
/*num_expert_group=*/8,
/*topk_group=*/3,
/*num_fused_shared_experts=*/0,
/*routed_scaling_factor=*/std::nullopt,
/*num_token_non_padded=*/std::nullopt);
// DeepSeek V3/R1 biased grouped top-k
auto [weights, ids] = biased_grouped_topk_cpu(
hidden_states,
gating_output, // [num_tokens, 256]
correction_bias, // [256]
/*topk=*/8,
/*renormalize=*/true,
/*num_expert_group=*/8,
/*topk_group=*/4,
/*num_fused_shared_experts=*/0,
/*routed_scaling_factor=*/std::nullopt,
/*num_token_non_padded=*/std::nullopt);