Implementation:InternLM Lmdeploy DecodingTemplate

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Template function that launches decoding-phase attention kernels with GQA-aware CTA partitioning, occupancy-based split-K selection, and optional post-kernel reduction.

Description

invokeDecoding<Kernel> mirrors invokeAttention but is specialized for the decoding phase (single-token generation). It computes the GQA group size and determines how many CTAs are needed per query group based on Kernel::CTA_H. It queries device occupancy to select an optimal split count, constructs the DecodingCtaMap grid, launches the kernel, and invokes reduction when split_cnt > 1 or context parallelism is active.

Usage

Called by the TurboMind decoding dispatch layer to run single-token attention. The Kernel template parameter is assembled from DecodingConfig.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/decoding_template.h
Lines: 1-104

Signature

template<class Kernel>
bool invokeDecoding(const typename Kernel::ParamType& params);

Import

#include "src/turbomind/kernels/attention/decoding_template.h"

I/O Contract

Inputs

Name	Type	Required	Description
params	Kernel::ParamType (AttentionParams<T>)	Yes	Fully populated attention parameters struct

Outputs

Name	Type	Description
params.out	T*	Attention output for the decoded token
return	bool	Always returns true on success

Usage Examples

using Config = DecodingConfig<arch::Sm80, half, half, 8, 128>;
invokeDecoding<Config::Kernel>(params);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment