Implementation:InternLM Lmdeploy AttentionMainloopSm70
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Volta/Turing mainloop for attention that uses synchronous global loads (no cp.async) with a load-compute-store pipeline, processing KV tiles in three phases: masked residue, unmasked, and window-masked.
Description
Mainloop<arch::Sm70, Impl_> implements the attention mainloop for SM70 (Volta) and SM75 (Turing) GPUs which lack the cp.async instruction. KV data is loaded into register fragments via GmemIterK::Load/GmemIterV::Load, stored to shared memory via Save, then consumed by the Impl's ComputeQK and ComputePV. The loop processes tiles in three phases: (1) residue tiles at the back with causal masking, (2) full unmasked tiles in the middle, and (3) window-masked tiles at the front. For quantized KV, it uses CombinedIterator to pair data and parameter loading. The causal mask is applied via ApplyCasualMask which supports context parallelism by adjusting offsets with cp_size/cp_rank.
Usage
Selected by AttentionConfig and DecodingConfig for SM70 and SM75 architectures.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/mainloop_sm70.h
- Lines: 1-149
Signature
namespace turbomind::attention {
template<class Impl_>
struct Mainloop<arch::Sm70, Impl_> {
using Impl = Impl_;
using T = typename Impl::T;
using Tkv = typename Impl::Tkv;
using GmemIterK = ...; // conditional on quantization
using GmemIterV = ...;
using SharedStorage = typename Impl::SharedStorage;
static constexpr int CTA_S = Impl::CTA_S;
__device__ void SetCpInfo(int cp_size, int cp_rank);
template<class CacheIter, class StoreS>
__device__ void operator()(
FragQ& frag_Q, CacheIter& cache_iter,
FragO& frag_O, FragM& frag_M, FragL& frag_L,
int offset_Q, int offset_K, int max_step,
int tile_iter, int mask_iter_back, int mask_iter_front,
int window_size, float qk_scale,
SharedStorage& storage, const StoreS& store_S);
__device__ void ApplyCasualMask(FragS& frag_S, int offset_Q, int offset_K, int window_size);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/mainloop_sm70.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| frag_Q | FragQ | Yes | Query fragment in registers |
| cache_iter | CacheIter | Yes | KV cache tile iterator |
| offset_Q | int | Yes | Query position offset for causal masking |
| offset_K | int | Yes | Starting K position (decremented per tile) |
| tile_iter | int | Yes | Number of tiles to process |
| mask_iter_back | int | Yes | Number of back-masked (residue) tiles |
| mask_iter_front | int | Yes | Number of front-masked (window) tiles |
| window_size | int | Yes | Sliding window attention size |
| qk_scale | float | Yes | Softmax scaling factor (log2e / sqrt(d)) |
Outputs
| Name | Type | Description |
|---|---|---|
| frag_O | FragO | Accumulated attention output |
| frag_M | FragM | Running maximum for online softmax |
| frag_L | FragL | Running sum for online softmax |
Usage Examples
// Instantiated as part of kernel type:
using MyMainloop = Mainloop<arch::Sm70, Impl<MMA_884, half, half, 1, 64, 64, 1, 16, 64, 128, 2>>;
// Called by AttentionUniversal::operator()