Implementation:InternLM Lmdeploy SamplingToppKernels

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Sampling
Last Updated	2026-02-07 15:00 GMT

Overview

CUDA kernels for top-P (nucleus) and min-P sampling, including sort initialization, softmax, cumulative-sum-based filtering, and combined top-P/min-P filtering.

Description

This header provides the top-P sampling pipeline components. invokeTopPSortInitialize() prepares CUB sort buffers by initializing offset arrays for segmented sorting. invokeSoftmax() applies softmax to logits with masking via kept counts. TopPSortParams and invokeTopPSort() perform a descending sort of logits and apply top-P filtering by computing cumulative sums and marking the cutoff point where cumulative probability exceeds the threshold. TopPMinPFilterParams and invokeTopPMinPFilter() combine top-P and min-P filtering: top-P selects tokens whose cumulative probability is within the threshold, while min-P removes tokens whose individual probability falls below a fraction of the maximum probability. BlockPrefixCallbackOp is a CUB helper for computing running prefix sums within a block.

Usage

Use these kernels to implement nucleus (top-P) sampling, min-P sampling, or combined top-P/min-P strategies for controlling diversity in text generation.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/sampling_topp_kernels.h

Signature

void invokeTopPSortInitialize(
    const int vocab_size_padded, const int vocab_size, const size_t batch_size,
    const int* top_ks, int* topp_id_val_buf, int* begin_offset_buf,
    int* end_offset_buf, cudaStream_t stream);

template<typename T>
void invokeSoftmax(T* logits, const int vocab_size_padded, const int vocab_size,
    const int batch_size, const int* kept, cudaStream_t stream);

struct TopPSortParams {
    void* logits; void* sorted_logits; int* sorted_indices;
    int* kept; int* top_ks; float* top_ps;
    int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPSort(TopPSortParams& params, cudaStream_t stream);

struct TopPMinPFilterParams {
    void* sorted_logits; int* sorted_indices; int* kept;
    float* top_ps; float* min_ps;
    int batch_size; int vocab_size; int vocab_size_padded;
};
template<typename T>
void invokeTopPMinPFilter(TopPMinPFilterParams& params, cudaStream_t stream);

Import

#include "src/turbomind/kernels/sampling_topp_kernels.h"

I/O Contract

Inputs

Name	Type	Required	Description
logits	void* / T*	Yes	Logit buffer (modified in-place for softmax)
top_ps	float*	Yes	Per-sequence top-P probability thresholds
min_ps	float*	No	Per-sequence min-P probability thresholds
top_ks	int*	No	Per-sequence top-K values (for initialization)
vocab_size	int	Yes	Actual vocabulary size
vocab_size_padded	int	Yes	Padded vocabulary size
batch_size	int	Yes	Number of sequences

Outputs

Name	Type	Description
sorted_logits	void*	Sorted probability values in descending order
sorted_indices	int*	Token indices corresponding to sorted values
kept	int*	Number of tokens kept after filtering per sequence

Usage Examples

using namespace turbomind;

// Top-P sort and filter
TopPSortParams sort_params{logits, sorted_logits, sorted_indices,
    kept, top_ks, top_ps, batch_size, vocab_size, vocab_padded};
invokeTopPSort<half>(sort_params, stream);

// Combined top-P and min-P filter
TopPMinPFilterParams filter_params{sorted_logits, sorted_indices, kept,
    top_ps, min_ps, batch_size, vocab_size, vocab_padded};
invokeTopPMinPFilter<half>(filter_params, stream);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment