Implementation:InternLM Lmdeploy SamplingToppKernels
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Sampling |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
CUDA kernels for top-P (nucleus) and min-P sampling, including sort initialization, softmax, cumulative-sum-based filtering, and combined top-P/min-P filtering.
Description
This header provides the top-P sampling pipeline components. invokeTopPSortInitialize() prepares CUB sort buffers by initializing offset arrays for segmented sorting. invokeSoftmax() applies softmax to logits with masking via kept counts. TopPSortParams and invokeTopPSort() perform a descending sort of logits and apply top-P filtering by computing cumulative sums and marking the cutoff point where cumulative probability exceeds the threshold. TopPMinPFilterParams and invokeTopPMinPFilter() combine top-P and min-P filtering: top-P selects tokens whose cumulative probability is within the threshold, while min-P removes tokens whose individual probability falls below a fraction of the maximum probability. BlockPrefixCallbackOp is a CUB helper for computing running prefix sums within a block.
Usage
Use these kernels to implement nucleus (top-P) sampling, min-P sampling, or combined top-P/min-P strategies for controlling diversity in text generation.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/sampling_topp_kernels.h
Signature
void invokeTopPSortInitialize(
const int vocab_size_padded, const int vocab_size, const size_t batch_size,
const int* top_ks, int* topp_id_val_buf, int* begin_offset_buf,
int* end_offset_buf, cudaStream_t stream);
template<typename T>
void invokeSoftmax(T* logits, const int vocab_size_padded, const int vocab_size,
const int batch_size, const int* kept, cudaStream_t stream);
struct TopPSortParams {
void* logits; void* sorted_logits; int* sorted_indices;
int* kept; int* top_ks; float* top_ps;
int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPSort(TopPSortParams& params, cudaStream_t stream);
struct TopPMinPFilterParams {
void* sorted_logits; int* sorted_indices; int* kept;
float* top_ps; float* min_ps;
int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPMinPFilter(TopPMinPFilterParams& params, cudaStream_t stream);
Import
#include "src/turbomind/kernels/sampling_topp_kernels.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| logits | void* / T* | Yes | Logit buffer (modified in-place for softmax) |
| top_ps | float* | Yes | Per-sequence top-P probability thresholds |
| min_ps | float* | No | Per-sequence min-P probability thresholds |
| top_ks | int* | No | Per-sequence top-K values (for initialization) |
| vocab_size | int | Yes | Actual vocabulary size |
| vocab_size_padded | int | Yes | Padded vocabulary size |
| batch_size | int | Yes | Number of sequences |
Outputs
| Name | Type | Description |
|---|---|---|
| sorted_logits | void* | Sorted probability values in descending order |
| sorted_indices | int* | Token indices corresponding to sorted values |
| kept | int* | Number of tokens kept after filtering per sequence |
Usage Examples
using namespace turbomind;
// Top-P sort and filter
TopPSortParams sort_params{logits, sorted_logits, sorted_indices,
kept, top_ks, top_ps, batch_size, vocab_size, vocab_padded};
invokeTopPSort<half>(sort_params, stream);
// Combined top-P and min-P filter
TopPMinPFilterParams filter_params{sorted_logits, sorted_indices, kept,
top_ps, min_ps, batch_size, vocab_size, vocab_padded};
invokeTopPMinPFilter<half>(filter_params, stream);