Implementation:InternLM Lmdeploy AttentionMainloopSm80

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Ampere (SM80) mainloop for attention using cp.async-based multi-stage pipelining, with specialized 2-stage variants that interleave K and V prefetching with QK and PV computation.

Description

Mainloop<Sm80_CpAsync<Stages>, Impl_> implements the attention mainloop for SM80+ GPUs using hardware-accelerated asynchronous global-to-shared-memory copies (cp.async) with pipeline commit/wait primitives. The generic N-stage variant pre-fills N-2 pipeline stages then enters a steady-state loop alternating between QK and PV computation phases, with interleaved prefetching of the next KV tile. Two specialized 2-stage variants are provided: one for HeadDim=192 (which uses a simpler alternating K/V pattern) and a default that overlaps the next K prefetch with the current PV computation for better latency hiding. The Prefetch helper batches iterator prefetch calls across kBatchK/kBatchV S-iterations and manages pipeline commits and block iterator advancement. Causal masking follows the same three-phase pattern as the SM70 mainloop.

Usage

Selected by AttentionConfig and DecodingConfig for SM80 (Ampere) and SM90 (Hopper) architectures. The Stages parameter is typically 2 for prefill (linear cache), 3 for prefill (block cache), or 5 for quantized decoding.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/mainloop_sm80.h
Lines: 1-463

Signature

namespace turbomind::attention {

template<int Stages>
struct Sm80_CpAsync {};

template<int Stages, class Impl_>
struct Mainloop<Sm80_CpAsync<Stages>, Impl_> {
    using Impl = Impl_;
    using T   = typename Impl::T;
    using Tkv = typename Impl::Tkv;
    static constexpr int CTA_S = Impl::CTA_S;

    using GmemIterK = ...; // conditional on quantization
    using GmemIterV = ...;
    using SharedStorage = typename Impl::SharedStorage;

    __device__ void SetCpInfo(int cp_size, int cp_rank);

    // Generic N-stage run
    template<int head_dim, class CacheIter, class StoreS, int Stages_>
    __device__ void Run(Sm80_CpAsync<Stages_>, std::integral_constant<int, head_dim>,
        FragQ&, CacheIter&, FragO&, FragM&, FragL&,
        int offset_Q, int offset_K, int max_step,
        int tile_iter, int mask_iter_back, int mask_iter_front,
        int window_size, float qk_scale,
        SharedStorage&, const StoreS&);

    // Specialized 2-stage run (interleaved K/V prefetch)
    template<int head_dim, class CacheIter, class StoreS>
    __device__ void Run(Sm80_CpAsync<2>, std::integral_constant<int, head_dim>, ...);

    // Specialized 2-stage run for HeadDim=192
    template<class CacheIter, class StoreS>
    __device__ void Run(Sm80_CpAsync<2>, std::integral_constant<int, 192>, ...);

    __device__ void Wait();
    __device__ void ApplyCasualMask(FragS&, int offset_Q, int offset_K, int window_size);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/mainloop_sm80.h"

I/O Contract

Inputs

Name	Type	Required	Description
Stages	int	Yes	Number of pipeline stages (2, 3, or 5)
frag_Q	FragQ	Yes	Query fragment in registers
cache_iter	CacheIter	Yes	KV cache tile iterator (block or linear)
offset_Q	int	Yes	Query offset for causal masking
offset_K	int	Yes	Starting K offset
max_step	int	Yes	Total valid K steps for boundary predication
tile_iter	int	Yes	Number of tiles to process
qk_scale	float	Yes	log2(e) / sqrt(d) softmax scale

Outputs

Name	Type	Description
frag_O	FragO	Accumulated output fragment
frag_M	FragM	Running per-row maximum
frag_L	FragL	Running per-row sum

Usage Examples

// 3-stage Ampere mainloop for block cache prefill
using MyMainloop = Mainloop<Sm80_CpAsync<3>,
    Impl<MMA_16816, half, half, 1, 64, 64, 1, 16, 64, 128, 3>>;

// 5-stage mainloop for INT8 quantized decoding
using DecodingMainloop = Mainloop<Sm80_CpAsync<5>,
    Impl<MMA_81616, half, uint8_t, 8, 1, 64, 8, 1, 16, 128, 5>>;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment