Implementation:InternLM Lmdeploy AttentionMainloopSm70

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Volta/Turing mainloop for attention that uses synchronous global loads (no cp.async) with a load-compute-store pipeline, processing KV tiles in three phases: masked residue, unmasked, and window-masked.

Description

Mainloop<arch::Sm70, Impl_> implements the attention mainloop for SM70 (Volta) and SM75 (Turing) GPUs which lack the cp.async instruction. KV data is loaded into register fragments via GmemIterK::Load/GmemIterV::Load, stored to shared memory via Save, then consumed by the Impl's ComputeQK and ComputePV. The loop processes tiles in three phases: (1) residue tiles at the back with causal masking, (2) full unmasked tiles in the middle, and (3) window-masked tiles at the front. For quantized KV, it uses CombinedIterator to pair data and parameter loading. The causal mask is applied via ApplyCasualMask which supports context parallelism by adjusting offsets with cp_size/cp_rank.

Usage

Selected by AttentionConfig and DecodingConfig for SM70 and SM75 architectures.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/mainloop_sm70.h
Lines: 1-149

Signature

namespace turbomind::attention {

template<class Impl_>
struct Mainloop<arch::Sm70, Impl_> {
    using Impl = Impl_;
    using T   = typename Impl::T;
    using Tkv = typename Impl::Tkv;

    using GmemIterK = ...; // conditional on quantization
    using GmemIterV = ...;
    using SharedStorage = typename Impl::SharedStorage;

    static constexpr int CTA_S = Impl::CTA_S;

    __device__ void SetCpInfo(int cp_size, int cp_rank);

    template<class CacheIter, class StoreS>
    __device__ void operator()(
        FragQ& frag_Q, CacheIter& cache_iter,
        FragO& frag_O, FragM& frag_M, FragL& frag_L,
        int offset_Q, int offset_K, int max_step,
        int tile_iter, int mask_iter_back, int mask_iter_front,
        int window_size, float qk_scale,
        SharedStorage& storage, const StoreS& store_S);

    __device__ void ApplyCasualMask(FragS& frag_S, int offset_Q, int offset_K, int window_size);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/mainloop_sm70.h"

I/O Contract

Inputs

Name	Type	Required	Description
frag_Q	FragQ	Yes	Query fragment in registers
cache_iter	CacheIter	Yes	KV cache tile iterator
offset_Q	int	Yes	Query position offset for causal masking
offset_K	int	Yes	Starting K position (decremented per tile)
tile_iter	int	Yes	Number of tiles to process
mask_iter_back	int	Yes	Number of back-masked (residue) tiles
mask_iter_front	int	Yes	Number of front-masked (window) tiles
window_size	int	Yes	Sliding window attention size
qk_scale	float	Yes	Softmax scaling factor (log2e / sqrt(d))

Outputs

Name	Type	Description
frag_O	FragO	Accumulated attention output
frag_M	FragM	Running maximum for online softmax
frag_L	FragL	Running sum for online softmax

Usage Examples

// Instantiated as part of kernel type:
using MyMainloop = Mainloop<arch::Sm70, Impl<MMA_884, half, half, 1, 64, 64, 1, 16, 64, 128, 2>>;
// Called by AttentionUniversal::operator()

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment