Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy DecodingKernels

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Transformer
Last Updated 2026-02-07 15:00 GMT

Overview

CUDA kernels for the decoding phase of GPT-style inference, including embedding lookup with positional encoding and padding, plus embedding padding/alignment utilities.

Description

This header declares kernel functions used during the autoregressive decoding phase. invokeEmbeddingLookupPosEncodingPadCount() retrieves the token at the current step from all_ids, looks up the embedding table, applies positional encoding, and accounts for padding offsets. An overloaded convenience version omits prompt-tuning parameters. invokePaddingEmbedding() and invokePaddingEmbeddingKernel() pad the embedding kernel and bias to the padded vocabulary size for efficient softmax computation. invokePlusScalar() adds a scalar value to all elements of a buffer. These kernels depend on pPromptTuningParam from gpt_kernels.h.

Usage

Use these kernels during the generation (decoding) phase of GPT inference to look up embeddings for newly generated tokens at each autoregressive step, and to prepare vocabulary-padded embedding matrices.

Code Reference

Source Location

Signature

template<typename T>
void invokeEmbeddingLookupPosEncodingPadCount(
    T* from_tensor, const T* embedding_table, const T* position_encoding,
    const int* all_ids, const int* padding_count,
    pPromptTuningParam<T> prompt_param,
    const int local_token_num, const int hidden_units, const T scale,
    const int step, const int token_num, const int ite, const int seq_len,
    cudaStream_t stream);

template<typename T>
void invokePaddingEmbedding(
    T* padded_embedding_kernel, T* padded_embedding_bias,
    const T* embedding_kernel, const T* embedding_bias,
    const int hidden_unit, const int vocab_size, const int vocab_size_padded,
    cudaStream_t stream);

template<typename T>
void invokePlusScalar(T* buf, const T val, const int size, cudaStream_t stream);

Import

#include "src/turbomind/kernels/decoding_kernels.h"

I/O Contract

Inputs

Name Type Required Description
all_ids const int* Yes All token IDs generated so far
embedding_table const T* Yes Token embedding weight matrix
position_encoding const T* No Positional encoding table
padding_count const int* No Per-sequence padding counts
step int Yes Current decoding step index
hidden_units int Yes Hidden dimension size

Outputs

Name Type Description
from_tensor T* Embedded output with positional encoding applied
padded_embedding_kernel T* Vocabulary-padded embedding matrix

Usage Examples

using namespace turbomind;

// Decode step embedding lookup
invokeEmbeddingLookupPosEncodingPadCount(
    decoder_input, embed_table, pos_enc, all_ids, padding_count,
    local_batch_size, hidden_dim, T(1.0), step, token_num, ite, stream);

// Pad embedding for softmax alignment
invokePaddingEmbedding(padded_kernel, padded_bias,
    embed_kernel, embed_bias, hidden_dim, vocab_size, vocab_padded, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment