Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy SamplingToppKernels

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Sampling
Last Updated 2026-02-07 15:00 GMT

Overview

CUDA kernels for top-P (nucleus) and min-P sampling, including sort initialization, softmax, cumulative-sum-based filtering, and combined top-P/min-P filtering.

Description

This header provides the top-P sampling pipeline components. invokeTopPSortInitialize() prepares CUB sort buffers by initializing offset arrays for segmented sorting. invokeSoftmax() applies softmax to logits with masking via kept counts. TopPSortParams and invokeTopPSort() perform a descending sort of logits and apply top-P filtering by computing cumulative sums and marking the cutoff point where cumulative probability exceeds the threshold. TopPMinPFilterParams and invokeTopPMinPFilter() combine top-P and min-P filtering: top-P selects tokens whose cumulative probability is within the threshold, while min-P removes tokens whose individual probability falls below a fraction of the maximum probability. BlockPrefixCallbackOp is a CUB helper for computing running prefix sums within a block.

Usage

Use these kernels to implement nucleus (top-P) sampling, min-P sampling, or combined top-P/min-P strategies for controlling diversity in text generation.

Code Reference

Source Location

Signature

void invokeTopPSortInitialize(
    const int vocab_size_padded, const int vocab_size, const size_t batch_size,
    const int* top_ks, int* topp_id_val_buf, int* begin_offset_buf,
    int* end_offset_buf, cudaStream_t stream);

template<typename T>
void invokeSoftmax(T* logits, const int vocab_size_padded, const int vocab_size,
    const int batch_size, const int* kept, cudaStream_t stream);

struct TopPSortParams {
    void* logits; void* sorted_logits; int* sorted_indices;
    int* kept; int* top_ks; float* top_ps;
    int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPSort(TopPSortParams& params, cudaStream_t stream);

struct TopPMinPFilterParams {
    void* sorted_logits; int* sorted_indices; int* kept;
    float* top_ps; float* min_ps;
    int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPMinPFilter(TopPMinPFilterParams& params, cudaStream_t stream);

Import

#include "src/turbomind/kernels/sampling_topp_kernels.h"

I/O Contract

Inputs

Name Type Required Description
logits void* / T* Yes Logit buffer (modified in-place for softmax)
top_ps float* Yes Per-sequence top-P probability thresholds
min_ps float* No Per-sequence min-P probability thresholds
top_ks int* No Per-sequence top-K values (for initialization)
vocab_size int Yes Actual vocabulary size
vocab_size_padded int Yes Padded vocabulary size
batch_size int Yes Number of sequences

Outputs

Name Type Description
sorted_logits void* Sorted probability values in descending order
sorted_indices int* Token indices corresponding to sorted values
kept int* Number of tokens kept after filtering per sequence

Usage Examples

using namespace turbomind;

// Top-P sort and filter
TopPSortParams sort_params{logits, sorted_logits, sorted_indices,
    kept, top_ks, top_ps, batch_size, vocab_size, vocab_padded};
invokeTopPSort<half>(sort_params, stream);

// Combined top-P and min-P filter
TopPMinPFilterParams filter_params{sorted_logits, sorted_indices, kept,
    top_ps, min_ps, batch_size, vocab_size, vocab_padded};
invokeTopPMinPFilter<half>(filter_params, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment