Implementation:InternLM Lmdeploy AttentionMainloopSm80
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Ampere (SM80) mainloop for attention using cp.async-based multi-stage pipelining, with specialized 2-stage variants that interleave K and V prefetching with QK and PV computation.
Description
Mainloop<Sm80_CpAsync<Stages>, Impl_> implements the attention mainloop for SM80+ GPUs using hardware-accelerated asynchronous global-to-shared-memory copies (cp.async) with pipeline commit/wait primitives. The generic N-stage variant pre-fills N-2 pipeline stages then enters a steady-state loop alternating between QK and PV computation phases, with interleaved prefetching of the next KV tile. Two specialized 2-stage variants are provided: one for HeadDim=192 (which uses a simpler alternating K/V pattern) and a default that overlaps the next K prefetch with the current PV computation for better latency hiding. The Prefetch helper batches iterator prefetch calls across kBatchK/kBatchV S-iterations and manages pipeline commits and block iterator advancement. Causal masking follows the same three-phase pattern as the SM70 mainloop.
Usage
Selected by AttentionConfig and DecodingConfig for SM80 (Ampere) and SM90 (Hopper) architectures. The Stages parameter is typically 2 for prefill (linear cache), 3 for prefill (block cache), or 5 for quantized decoding.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/mainloop_sm80.h
- Lines: 1-463
Signature
namespace turbomind::attention {
template<int Stages>
struct Sm80_CpAsync {};
template<int Stages, class Impl_>
struct Mainloop<Sm80_CpAsync<Stages>, Impl_> {
using Impl = Impl_;
using T = typename Impl::T;
using Tkv = typename Impl::Tkv;
static constexpr int CTA_S = Impl::CTA_S;
using GmemIterK = ...; // conditional on quantization
using GmemIterV = ...;
using SharedStorage = typename Impl::SharedStorage;
__device__ void SetCpInfo(int cp_size, int cp_rank);
// Generic N-stage run
template<int head_dim, class CacheIter, class StoreS, int Stages_>
__device__ void Run(Sm80_CpAsync<Stages_>, std::integral_constant<int, head_dim>,
FragQ&, CacheIter&, FragO&, FragM&, FragL&,
int offset_Q, int offset_K, int max_step,
int tile_iter, int mask_iter_back, int mask_iter_front,
int window_size, float qk_scale,
SharedStorage&, const StoreS&);
// Specialized 2-stage run (interleaved K/V prefetch)
template<int head_dim, class CacheIter, class StoreS>
__device__ void Run(Sm80_CpAsync<2>, std::integral_constant<int, head_dim>, ...);
// Specialized 2-stage run for HeadDim=192
template<class CacheIter, class StoreS>
__device__ void Run(Sm80_CpAsync<2>, std::integral_constant<int, 192>, ...);
__device__ void Wait();
__device__ void ApplyCasualMask(FragS&, int offset_Q, int offset_K, int window_size);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/mainloop_sm80.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Stages | int | Yes | Number of pipeline stages (2, 3, or 5) |
| frag_Q | FragQ | Yes | Query fragment in registers |
| cache_iter | CacheIter | Yes | KV cache tile iterator (block or linear) |
| offset_Q | int | Yes | Query offset for causal masking |
| offset_K | int | Yes | Starting K offset |
| max_step | int | Yes | Total valid K steps for boundary predication |
| tile_iter | int | Yes | Number of tiles to process |
| qk_scale | float | Yes | log2(e) / sqrt(d) softmax scale |
Outputs
| Name | Type | Description |
|---|---|---|
| frag_O | FragO | Accumulated output fragment |
| frag_M | FragM | Running per-row maximum |
| frag_L | FragL | Running per-row sum |
Usage Examples
// 3-stage Ampere mainloop for block cache prefill
using MyMainloop = Mainloop<Sm80_CpAsync<3>,
Impl<MMA_16816, half, half, 1, 64, 64, 1, 16, 64, 128, 3>>;
// 5-stage mainloop for INT8 quantized decoding
using DecodingMainloop = Mainloop<Sm80_CpAsync<5>,
Impl<MMA_81616, half, uint8_t, 8, 1, 64, 8, 1, 16, 128, 5>>;