Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionMainloopSm70

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Volta/Turing mainloop for attention that uses synchronous global loads (no cp.async) with a load-compute-store pipeline, processing KV tiles in three phases: masked residue, unmasked, and window-masked.

Description

Mainloop<arch::Sm70, Impl_> implements the attention mainloop for SM70 (Volta) and SM75 (Turing) GPUs which lack the cp.async instruction. KV data is loaded into register fragments via GmemIterK::Load/GmemIterV::Load, stored to shared memory via Save, then consumed by the Impl's ComputeQK and ComputePV. The loop processes tiles in three phases: (1) residue tiles at the back with causal masking, (2) full unmasked tiles in the middle, and (3) window-masked tiles at the front. For quantized KV, it uses CombinedIterator to pair data and parameter loading. The causal mask is applied via ApplyCasualMask which supports context parallelism by adjusting offsets with cp_size/cp_rank.

Usage

Selected by AttentionConfig and DecodingConfig for SM70 and SM75 architectures.

Code Reference

Source Location

Signature

namespace turbomind::attention {

template<class Impl_>
struct Mainloop<arch::Sm70, Impl_> {
    using Impl = Impl_;
    using T   = typename Impl::T;
    using Tkv = typename Impl::Tkv;

    using GmemIterK = ...; // conditional on quantization
    using GmemIterV = ...;
    using SharedStorage = typename Impl::SharedStorage;

    static constexpr int CTA_S = Impl::CTA_S;

    __device__ void SetCpInfo(int cp_size, int cp_rank);

    template<class CacheIter, class StoreS>
    __device__ void operator()(
        FragQ& frag_Q, CacheIter& cache_iter,
        FragO& frag_O, FragM& frag_M, FragL& frag_L,
        int offset_Q, int offset_K, int max_step,
        int tile_iter, int mask_iter_back, int mask_iter_front,
        int window_size, float qk_scale,
        SharedStorage& storage, const StoreS& store_S);

    __device__ void ApplyCasualMask(FragS& frag_S, int offset_Q, int offset_K, int window_size);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/mainloop_sm70.h"

I/O Contract

Inputs

Name Type Required Description
frag_Q FragQ Yes Query fragment in registers
cache_iter CacheIter Yes KV cache tile iterator
offset_Q int Yes Query position offset for causal masking
offset_K int Yes Starting K position (decremented per tile)
tile_iter int Yes Number of tiles to process
mask_iter_back int Yes Number of back-masked (residue) tiles
mask_iter_front int Yes Number of front-masked (window) tiles
window_size int Yes Sliding window attention size
qk_scale float Yes Softmax scaling factor (log2e / sqrt(d))

Outputs

Name Type Description
frag_O FragO Accumulated attention output
frag_M FragM Running maximum for online softmax
frag_L FragL Running sum for online softmax

Usage Examples

// Instantiated as part of kernel type:
using MyMainloop = Mainloop<arch::Sm70, Impl<MMA_884, half, half, 1, 64, 64, 1, 16, 64, 128, 2>>;
// Called by AttentionUniversal::operator()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment