Implementation:InternLM Lmdeploy AttentionIterator

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Base iterator classes for loading KV cache tiles from global memory to shared memory, including thread-map-based offset computation and a combiner for pairing data and quantization parameter iterators.

Description

BaseGmemIterator establishes the common pattern for global-to-shared-memory KV tile loading. It uses a Map type to compute per-thread (offset_s, offset_c) coordinates and provides SetSmem to bind a shared memory target and ClearSmem to zero-initialize shared memory tiles. BaseSmemIterator wraps a shared memory pointer with a layout accessor. CombinedIterator composes two iterators (typically data + quantization parameters) into a single unit with unified Prefetch, Load, Save, and ClearSmem interfaces.

Usage

BaseGmemIterator is the base for architecture-specific iterators (Sm70GmemIterator, Sm80GmemIterator). CombinedIterator is used when KV quantization is enabled to pair data and parameter loading.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/iterator.h
Lines: 1-127

Signature

template<class T, class Map, class SmemLayout>
struct BaseGmemIterator {
    using ElementType = T;
    using AccessType  = Array<T, Map::kAccessC>;
    static constexpr int kIterCount = Map::kIterS * Map::kIterC;
    using Fragment = Array<T, Map::kAccessC>[Map::kIterS][Map::kIterC];

    __device__ BaseGmemIterator();
    __device__ void SetSmem(Pointer smem);
    __device__ void ClearSmem(int pipe_iter = 0);
};

template<class T, class Layout>
struct BaseSmemIterator {
    __device__ explicit BaseSmemIterator(T* smem);
};

template<class Iterator0, class Iterator1>
struct CombinedIterator {
    struct Fragment { ... };
    template<typename... Args>
    __device__ void Prefetch(Args... args);
    template<bool is_residue, class CacheIter>
    __device__ void Load(const CacheIter&, Fragment&, int max_s);
    __device__ void Save(const Fragment&);
    __device__ void ClearSmem(int pipe_iter = 0);
    template<class P0, class P1>
    __device__ void SetSmem(P0, P1);
};

Import

#include "src/turbomind/kernels/attention/iterator.h"

I/O Contract

Inputs

Name	Type	Required	Description
T	typename	Yes	Element type for the iterator
Map	typename	Yes	Thread mapping type (e.g., RakedThreadMap)
SmemLayout	typename	Yes	Shared memory layout with swizzle

Outputs

Name	Type	Description
Fragment	Array<T,...>[IterS][IterC]	Loaded KV tile data in registers
smem_	Pointer	Shared memory region written during Save/ClearSmem

Usage Examples

// Composed into architecture-specific iterators:
using GmemIterK = Sm80GmemIterator<half, ThreadMapKV, SmemLayoutK, 0>;
using GmemIterV = Sm80GmemIterator<half, ThreadMapKV, SmemLayoutV, 1>;
// With quantization:
using CombinedK = CombinedIterator<GmemIterK, Sm80GmemIterator<half, ThreadMapKVp, SmemLayoutKVp, 2>>;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment