Implementation:InternLM Lmdeploy AttentionTemplate
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Template function that orchestrates the launch of prefill attention kernels, including shared memory configuration, occupancy-based split-K computation, and optional post-kernel reduction.
Description
invokeAttention<Kernel> is the host-side entry point for launching fused multi-head attention kernels during the prefill phase. It computes the required dynamic shared memory size from Kernel::SharedStorage, queries device occupancy to determine an optimal split count, constructs the CTA map and cache iterator factory, launches the kernel, and conditionally invokes a split-K reduction pass when the workload is distributed across multiple splits or context-parallel ranks.
Usage
Used by the TurboMind attention dispatch layer to launch prefill attention for a given architecture-specific kernel configuration. The Kernel template parameter is typically an AttentionUniversal specialization assembled from AttentionConfig.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/attention_template.h
- Lines: 1-103
Signature
template<class Kernel>
void invokeAttention(const typename Kernel::ParamType& params);
Import
#include "src/turbomind/kernels/attention/attention_template.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| params | Kernel::ParamType (AttentionParams<T>) | Yes | Fully populated attention parameters struct |
Outputs
| Name | Type | Description |
|---|---|---|
| params.out | T* | Output written in-place via the params struct |
| params.partial_O | float* | Partial outputs when split-K > 1 |
Usage Examples
// Launch prefill attention for SM80 with block cache
using Config = AttentionConfig<arch::Sm80, half, 128, CacheType::kBlock>;
invokeAttention<Config::Kernel>(params);