Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy SamplingPenaltyKernels

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Sampling
Last Updated 2026-02-07 15:00 GMT

Overview

CUDA kernels for applying sampling penalties to logits, including repetition penalty, temperature scaling, and minimum length penalty.

Description

This header declares three penalty kernels applied to logits before the sampling stage. ApplyRepetitionPenalty() modifies logits for previously generated tokens by dividing or multiplying by a penalty factor, discouraging the model from repeating tokens. It accepts per-sequence penalty values, token ID history pointers, and sequence lengths. invokeBatchApplyTemperaturePenalty_v2() scales logits by per-sequence temperature values and optionally adds a bias term. invokeMinLengthPenalty() sets the logits of end-of-sequence tokens to negative infinity for sequences that have not reached their minimum required length, preventing premature termination.

Usage

Use these kernels in the sampling pipeline after logit computation and before top-k/top-p filtering to control generation quality via repetition avoidance, temperature-based randomness, and minimum length enforcement.

Code Reference

Source Location

Signature

void ApplyRepetitionPenalty(Tensor&               logits,
                            const Buffer_<float>& penalties,
                            const Buffer_<int*>&  token_ids_ptrs,
                            const Buffer_<int>&   sequence_length,
                            cudaStream_t          stream);

template<typename T>
void invokeBatchApplyTemperaturePenalty_v2(
    T* logits, const T* bias, const float* temperatures,
    const int batch_size, const int vocab_size, const int vocab_size_padd,
    cudaStream_t stream);

template<typename T>
void invokeMinLengthPenalty(
    T* logits, const int* min_lengths, const int* sequence_lengths,
    const int vocab_size_padded, const int batch_size,
    const int* end_ids, const int end_ids_size, cudaStream_t stream);

Import

#include "src/turbomind/kernels/sampling_penalty_kernels.h"

I/O Contract

Inputs

Name Type Required Description
logits Tensor / T* Yes Logit tensor to modify in-place
penalties Buffer_<float> Yes Per-sequence repetition penalty factors
token_ids_ptrs Buffer_<int*> Yes Pointers to each sequence's generated token IDs
temperatures const float* Yes Per-sequence temperature values
min_lengths const int* Yes Minimum length threshold per sequence
end_ids const int* Yes End-of-sequence token ID(s)

Outputs

Name Type Description
logits Tensor / T* Modified logits with penalties applied (in-place)

Usage Examples

using namespace turbomind;

// Apply repetition penalty
ApplyRepetitionPenalty(logits, penalties, token_ptrs, seq_lengths, stream);

// Apply temperature scaling
invokeBatchApplyTemperaturePenalty_v2(
    logits_ptr, bias_ptr, temperatures, batch_size,
    vocab_size, vocab_size_padded, stream);

// Enforce minimum length
invokeMinLengthPenalty(logits_ptr, min_lens, seq_lens,
    vocab_size_padded, batch_size, end_ids, 1, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment