Implementation:InternLM Lmdeploy DecodingTemplate
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Template function that launches decoding-phase attention kernels with GQA-aware CTA partitioning, occupancy-based split-K selection, and optional post-kernel reduction.
Description
invokeDecoding<Kernel> mirrors invokeAttention but is specialized for the decoding phase (single-token generation). It computes the GQA group size and determines how many CTAs are needed per query group based on Kernel::CTA_H. It queries device occupancy to select an optimal split count, constructs the DecodingCtaMap grid, launches the kernel, and invokes reduction when split_cnt > 1 or context parallelism is active.
Usage
Called by the TurboMind decoding dispatch layer to run single-token attention. The Kernel template parameter is assembled from DecodingConfig.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/decoding_template.h
- Lines: 1-104
Signature
template<class Kernel>
bool invokeDecoding(const typename Kernel::ParamType& params);
Import
#include "src/turbomind/kernels/attention/decoding_template.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| params | Kernel::ParamType (AttentionParams<T>) | Yes | Fully populated attention parameters struct |
Outputs
| Name | Type | Description |
|---|---|---|
| params.out | T* | Attention output for the decoded token |
| return | bool | Always returns true on success |
Usage Examples
using Config = DecodingConfig<arch::Sm80, half, half, 8, 128>;
invokeDecoding<Config::Kernel>(params);