Implementation:InternLM Lmdeploy SamplingPenaltyKernels
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Sampling |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
CUDA kernels for applying sampling penalties to logits, including repetition penalty, temperature scaling, and minimum length penalty.
Description
This header declares three penalty kernels applied to logits before the sampling stage. ApplyRepetitionPenalty() modifies logits for previously generated tokens by dividing or multiplying by a penalty factor, discouraging the model from repeating tokens. It accepts per-sequence penalty values, token ID history pointers, and sequence lengths. invokeBatchApplyTemperaturePenalty_v2() scales logits by per-sequence temperature values and optionally adds a bias term. invokeMinLengthPenalty() sets the logits of end-of-sequence tokens to negative infinity for sequences that have not reached their minimum required length, preventing premature termination.
Usage
Use these kernels in the sampling pipeline after logit computation and before top-k/top-p filtering to control generation quality via repetition avoidance, temperature-based randomness, and minimum length enforcement.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/sampling_penalty_kernels.h
Signature
void ApplyRepetitionPenalty(Tensor& logits,
const Buffer_<float>& penalties,
const Buffer_<int*>& token_ids_ptrs,
const Buffer_<int>& sequence_length,
cudaStream_t stream);
template<typename T>
void invokeBatchApplyTemperaturePenalty_v2(
T* logits, const T* bias, const float* temperatures,
const int batch_size, const int vocab_size, const int vocab_size_padd,
cudaStream_t stream);
template<typename T>
void invokeMinLengthPenalty(
T* logits, const int* min_lengths, const int* sequence_lengths,
const int vocab_size_padded, const int batch_size,
const int* end_ids, const int end_ids_size, cudaStream_t stream);
Import
#include "src/turbomind/kernels/sampling_penalty_kernels.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| logits | Tensor / T* | Yes | Logit tensor to modify in-place |
| penalties | Buffer_<float> | Yes | Per-sequence repetition penalty factors |
| token_ids_ptrs | Buffer_<int*> | Yes | Pointers to each sequence's generated token IDs |
| temperatures | const float* | Yes | Per-sequence temperature values |
| min_lengths | const int* | Yes | Minimum length threshold per sequence |
| end_ids | const int* | Yes | End-of-sequence token ID(s) |
Outputs
| Name | Type | Description |
|---|---|---|
| logits | Tensor / T* | Modified logits with penalties applied (in-place) |
Usage Examples
using namespace turbomind;
// Apply repetition penalty
ApplyRepetitionPenalty(logits, penalties, token_ptrs, seq_lengths, stream);
// Apply temperature scaling
invokeBatchApplyTemperaturePenalty_v2(
logits_ptr, bias_ptr, temperatures, batch_size,
vocab_size, vocab_size_padded, stream);
// Enforce minimum length
invokeMinLengthPenalty(logits_ptr, min_lens, seq_lens,
vocab_size_padded, batch_size, end_ids, 1, stream);