Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy DecodingTemplate

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Template function that launches decoding-phase attention kernels with GQA-aware CTA partitioning, occupancy-based split-K selection, and optional post-kernel reduction.

Description

invokeDecoding<Kernel> mirrors invokeAttention but is specialized for the decoding phase (single-token generation). It computes the GQA group size and determines how many CTAs are needed per query group based on Kernel::CTA_H. It queries device occupancy to select an optimal split count, constructs the DecodingCtaMap grid, launches the kernel, and invokes reduction when split_cnt > 1 or context parallelism is active.

Usage

Called by the TurboMind decoding dispatch layer to run single-token attention. The Kernel template parameter is assembled from DecodingConfig.

Code Reference

Source Location

Signature

template<class Kernel>
bool invokeDecoding(const typename Kernel::ParamType& params);

Import

#include "src/turbomind/kernels/attention/decoding_template.h"

I/O Contract

Inputs

Name Type Required Description
params Kernel::ParamType (AttentionParams<T>) Yes Fully populated attention parameters struct

Outputs

Name Type Description
params.out T* Attention output for the decoded token
return bool Always returns true on success

Usage Examples

using Config = DecodingConfig<arch::Sm80, half, half, 8, 128>;
invokeDecoding<Config::Kernel>(params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment