Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang CPU TopK

From Leeroopedia


Knowledge Sources
Domains Machine Learning, CPU Kernels
Last Updated 2026-02-10 00:00 GMT

Overview

Implements CPU-optimized top-k selection kernels for Mixture-of-Experts (MoE) routing, supporting sigmoid, softmax, and grouped top-k variants with vectorized SIMD operations.

Description

This file provides four top-k selection strategies for MoE expert routing:

  • topk_sigmoid_kernel_impl -- Computes sigmoid scores over gating outputs and selects the top-k expert per token. Currently supports topk=1 only.
  • topk_softmax_kernel_impl -- Applies vectorized softmax over experts, then selects top-k. The softmax template function is a compile-time-sized SIMD implementation with three passes: find max, compute exp and sum, and normalize.
  • grouped_topk_kernel_impl -- Implements DeepSeek V2-style grouped expert routing: computes softmax scores, finds per-group maximum scores, selects topk_group groups, then extracts the top-k experts from those groups. Parallelized across tokens via at::parallel_for.
  • biased_grouped_topk_kernel_impl -- DeepSeek V3/R1 variant that adds a correction_bias to sigmoid scores before group and expert selection.

The softmax implementation handles arbitrary compile-time expert counts (1 to 512) using SIMD intrinsics via PyTorch's at::vec::Vectorized abstraction. Small sizes (< vector width) use masked loads and stores, while larger sizes use fully unrolled vector loops. The vec_reduce_max and vec_reduce_sum helpers perform efficient horizontal reductions.

Public API functions: topk_sigmoid_cpu, topk_softmax_cpu, grouped_topk_cpu, and biased_grouped_topk_cpu. All return a tuple of (topk_weights, topk_ids).

Usage

Use these kernels during CPU MoE inference for expert routing. The specific variant depends on the model architecture: topk_sigmoid for simple sigmoid gating, topk_softmax for softmax-based routing, grouped_topk for DeepSeek V2-style grouped routing, and biased_grouped_topk for DeepSeek V3/R1 with correction bias. Supported expert counts include 1, 2, 4, 8, 16, 32, 64, 128, 160, 256, 384, and 512.

Code Reference

Source Location

Signature

// Public API functions
std::tuple<at::Tensor, at::Tensor>
topk_sigmoid_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize);

std::tuple<at::Tensor, at::Tensor>
topk_softmax_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize);

std::tuple<at::Tensor, at::Tensor>
grouped_topk_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    int64_t topk,
    bool renormalize,
    int64_t num_expert_group,
    int64_t topk_group,
    int64_t num_fused_shared_experts,
    std::optional<double> routed_scaling_factor,
    std::optional<at::Tensor> num_token_non_padded);

std::tuple<at::Tensor, at::Tensor>
biased_grouped_topk_cpu(
    at::Tensor& hidden_states,
    at::Tensor& gating_output,
    at::Tensor& correction_bias,
    int64_t topk,
    bool renormalize,
    int64_t num_expert_group,
    int64_t topk_group,
    int64_t num_fused_shared_experts,
    std::optional<double> routed_scaling_factor,
    std::optional<at::Tensor> num_token_non_padded);

// Internal: SIMD softmax with compile-time size
template <typename scalar_t, int SIZE>
inline void softmax(float* __restrict__ out, const scalar_t* __restrict__ input);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name Type Required Description
hidden_states at::Tensor [num_tokens, hidden_size] Yes Input hidden states (used for dtype and num_tokens inference)
gating_output at::Tensor [num_tokens, num_experts] Yes Router logits from the gating network (BFloat16 or Half)
topk int64_t Yes Number of experts to select per token
renormalize bool Yes Whether to renormalize topk weights to sum to 1
num_expert_group int64_t Depends Number of expert groups (for grouped variants)
topk_group int64_t Depends Number of groups to select (for grouped variants)
correction_bias at::Tensor [num_experts] Depends Bias added to scores before selection (biased variant only)
num_fused_shared_experts int64_t Depends Number of fused shared experts (must be 0 currently)
routed_scaling_factor std::optional<double> No Scaling factor for routed experts (must be None or 1.0 currently)

Outputs

Name Type Description
topk_weights at::Tensor [num_tokens, topk] Selected expert routing weights (float32)
topk_ids at::Tensor [num_tokens, topk] Selected expert indices (int32)

Usage Examples

// Simple softmax top-k routing
auto [weights, ids] = topk_softmax_cpu(
    hidden_states,          // [num_tokens, hidden_size]
    gating_output,          // [num_tokens, num_experts]
    /*topk=*/2,
    /*renormalize=*/true);

// DeepSeek V2-style grouped top-k
auto [weights, ids] = grouped_topk_cpu(
    hidden_states,
    gating_output,          // [num_tokens, 160]
    /*topk=*/6,
    /*renormalize=*/true,
    /*num_expert_group=*/8,
    /*topk_group=*/3,
    /*num_fused_shared_experts=*/0,
    /*routed_scaling_factor=*/std::nullopt,
    /*num_token_non_padded=*/std::nullopt);

// DeepSeek V3/R1 biased grouped top-k
auto [weights, ids] = biased_grouped_topk_cpu(
    hidden_states,
    gating_output,          // [num_tokens, 256]
    correction_bias,        // [256]
    /*topk=*/8,
    /*renormalize=*/true,
    /*num_expert_group=*/8,
    /*topk_group=*/4,
    /*num_fused_shared_experts=*/0,
    /*routed_scaling_factor=*/std::nullopt,
    /*num_token_non_padded=*/std::nullopt);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment