Implementation:InternLM Lmdeploy SamplingPenaltyKernels

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Sampling
Last Updated	2026-02-07 15:00 GMT

Overview

CUDA kernels for applying sampling penalties to logits, including repetition penalty, temperature scaling, and minimum length penalty.

Description

This header declares three penalty kernels applied to logits before the sampling stage. ApplyRepetitionPenalty() modifies logits for previously generated tokens by dividing or multiplying by a penalty factor, discouraging the model from repeating tokens. It accepts per-sequence penalty values, token ID history pointers, and sequence lengths. invokeBatchApplyTemperaturePenalty_v2() scales logits by per-sequence temperature values and optionally adds a bias term. invokeMinLengthPenalty() sets the logits of end-of-sequence tokens to negative infinity for sequences that have not reached their minimum required length, preventing premature termination.

Usage

Use these kernels in the sampling pipeline after logit computation and before top-k/top-p filtering to control generation quality via repetition avoidance, temperature-based randomness, and minimum length enforcement.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/sampling_penalty_kernels.h

Signature

void ApplyRepetitionPenalty(Tensor&               logits,
                            const Buffer_<float>& penalties,
                            const Buffer_<int*>&  token_ids_ptrs,
                            const Buffer_<int>&   sequence_length,
                            cudaStream_t          stream);

template<typename T>
void invokeBatchApplyTemperaturePenalty_v2(
    T* logits, const T* bias, const float* temperatures,
    const int batch_size, const int vocab_size, const int vocab_size_padd,
    cudaStream_t stream);

template<typename T>
void invokeMinLengthPenalty(
    T* logits, const int* min_lengths, const int* sequence_lengths,
    const int vocab_size_padded, const int batch_size,
    const int* end_ids, const int end_ids_size, cudaStream_t stream);

Import

#include "src/turbomind/kernels/sampling_penalty_kernels.h"

I/O Contract

Inputs

Name	Type	Required	Description
logits	Tensor / T*	Yes	Logit tensor to modify in-place
penalties	Buffer_<float>	Yes	Per-sequence repetition penalty factors
token_ids_ptrs	Buffer_<int*>	Yes	Pointers to each sequence's generated token IDs
temperatures	const float*	Yes	Per-sequence temperature values
min_lengths	const int*	Yes	Minimum length threshold per sequence
end_ids	const int*	Yes	End-of-sequence token ID(s)

Outputs

Name	Type	Description
logits	Tensor / T*	Modified logits with penalties applied (in-place)

Usage Examples

using namespace turbomind;

// Apply repetition penalty
ApplyRepetitionPenalty(logits, penalties, token_ptrs, seq_lengths, stream);

// Apply temperature scaling
invokeBatchApplyTemperaturePenalty_v2(
    logits_ptr, bias_ptr, temperatures, batch_size,
    vocab_size, vocab_size_padded, stream);

// Enforce minimum length
invokeMinLengthPenalty(logits_ptr, min_lens, seq_lens,
    vocab_size_padded, batch_size, end_ids, 1, stream);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment