Implementation:InternLM Lmdeploy DecodingKernels
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Transformer |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
CUDA kernels for the decoding phase of GPT-style inference, including embedding lookup with positional encoding and padding, plus embedding padding/alignment utilities.
Description
This header declares kernel functions used during the autoregressive decoding phase. invokeEmbeddingLookupPosEncodingPadCount() retrieves the token at the current step from all_ids, looks up the embedding table, applies positional encoding, and accounts for padding offsets. An overloaded convenience version omits prompt-tuning parameters. invokePaddingEmbedding() and invokePaddingEmbeddingKernel() pad the embedding kernel and bias to the padded vocabulary size for efficient softmax computation. invokePlusScalar() adds a scalar value to all elements of a buffer. These kernels depend on pPromptTuningParam from gpt_kernels.h.
Usage
Use these kernels during the generation (decoding) phase of GPT inference to look up embeddings for newly generated tokens at each autoregressive step, and to prepare vocabulary-padded embedding matrices.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/decoding_kernels.h
Signature
template<typename T>
void invokeEmbeddingLookupPosEncodingPadCount(
T* from_tensor, const T* embedding_table, const T* position_encoding,
const int* all_ids, const int* padding_count,
pPromptTuningParam<T> prompt_param,
const int local_token_num, const int hidden_units, const T scale,
const int step, const int token_num, const int ite, const int seq_len,
cudaStream_t stream);
template<typename T>
void invokePaddingEmbedding(
T* padded_embedding_kernel, T* padded_embedding_bias,
const T* embedding_kernel, const T* embedding_bias,
const int hidden_unit, const int vocab_size, const int vocab_size_padded,
cudaStream_t stream);
template<typename T>
void invokePlusScalar(T* buf, const T val, const int size, cudaStream_t stream);
Import
#include "src/turbomind/kernels/decoding_kernels.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| all_ids | const int* | Yes | All token IDs generated so far |
| embedding_table | const T* | Yes | Token embedding weight matrix |
| position_encoding | const T* | No | Positional encoding table |
| padding_count | const int* | No | Per-sequence padding counts |
| step | int | Yes | Current decoding step index |
| hidden_units | int | Yes | Hidden dimension size |
Outputs
| Name | Type | Description |
|---|---|---|
| from_tensor | T* | Embedded output with positional encoding applied |
| padded_embedding_kernel | T* | Vocabulary-padded embedding matrix |
Usage Examples
using namespace turbomind;
// Decode step embedding lookup
invokeEmbeddingLookupPosEncodingPadCount(
decoder_input, embed_table, pos_enc, all_ids, padding_count,
local_batch_size, hidden_dim, T(1.0), step, token_num, ite, stream);
// Pad embedding for softmax alignment
invokePaddingEmbedding(padded_kernel, padded_bias,
embed_kernel, embed_bias, hidden_dim, vocab_size, vocab_padded, stream);