Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionTemplate

From Leeroopedia
Revision as of 15:13, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_AttentionTemplate.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Template function that orchestrates the launch of prefill attention kernels, including shared memory configuration, occupancy-based split-K computation, and optional post-kernel reduction.

Description

invokeAttention<Kernel> is the host-side entry point for launching fused multi-head attention kernels during the prefill phase. It computes the required dynamic shared memory size from Kernel::SharedStorage, queries device occupancy to determine an optimal split count, constructs the CTA map and cache iterator factory, launches the kernel, and conditionally invokes a split-K reduction pass when the workload is distributed across multiple splits or context-parallel ranks.

Usage

Used by the TurboMind attention dispatch layer to launch prefill attention for a given architecture-specific kernel configuration. The Kernel template parameter is typically an AttentionUniversal specialization assembled from AttentionConfig.

Code Reference

Source Location

Signature

template<class Kernel>
void invokeAttention(const typename Kernel::ParamType& params);

Import

#include "src/turbomind/kernels/attention/attention_template.h"

I/O Contract

Inputs

Name Type Required Description
params Kernel::ParamType (AttentionParams<T>) Yes Fully populated attention parameters struct

Outputs

Name Type Description
params.out T* Output written in-place via the params struct
params.partial_O float* Partial outputs when split-K > 1

Usage Examples

// Launch prefill attention for SM80 with block cache
using Config = AttentionConfig<arch::Sm80, half, 128, CacheType::kBlock>;
invokeAttention<Config::Kernel>(params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment