Implementation:InternLM Lmdeploy AttentionBlock

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Defines the paged (blocked) KV cache memory layout, including block configuration, per-head data/parameter accessors, and byte-offset computation for quantized and full-precision cache storage.

Description

This header implements the paged KV cache block abstraction used by TurboMind. The block::Config struct captures per-head dimensions, block length, and quantization bit widths. The block::Layout struct computes byte offsets for K/V data and quantization parameters within a memory block following an L(H2SDQ+H2S2T) layout scheme (Layer, Head, Sequence, Data/Quantization). The block::Head class provides typed accessors (k_data, v_data, k_param, v_param) that resolve a (layer_id, head_id, timestep) coordinate to a pointer within a block, supporting sub-byte types (uint4_t) through SubBytePtr.

Usage

Used internally by BlockIterator and BlockIteratorFactory to navigate the paged KV cache during attention kernel execution.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/block.h
Lines: 1-237

Signature

namespace turbomind::block {

template<class T, class Tkv, int HeadDim>
struct Config {
    int head_num_;
    int block_len_;
    TM_HOST_DEVICE constexpr int t_bits() const;
    TM_HOST_DEVICE constexpr int q_bits() const;
    TM_HOST_DEVICE constexpr int head_dim() const;
    TM_HOST_DEVICE int head_num() const;
    TM_HOST_DEVICE constexpr int block_len() const;
};

template<class T, class Tkv, class Layout>
class Head {
public:
    TM_HOST_DEVICE Head(Layout layout, int layer_id, int head_id);
    TM_HOST_DEVICE auto k_data(char* block, int ti) const;
    TM_HOST_DEVICE auto v_data(char* block, int ti) const;
    TM_HOST_DEVICE T* k_param(char* block, int ti) const;
    TM_HOST_DEVICE T* v_param(char* block, int ti) const;
    template<class Func>
    TM_HOST_DEVICE auto with(char** block_ptrs, int ti, Func&& func) const;
};

template<class Config_>
struct Layout {
    TM_HOST_DEVICE int k_data(int layer, int head, int token) const;
    TM_HOST_DEVICE int v_data(int layer, int head, int token) const;
    TM_HOST_DEVICE int k_param(int layer, int head, int token) const;
    TM_HOST_DEVICE int v_param(int layer, int head, int token) const;
    TM_HOST_DEVICE int block_size(int layer_num) const;
};

} // namespace turbomind::block

Import

#include "src/turbomind/kernels/attention/block.h"

I/O Contract

Inputs

Name	Type	Required	Description
head_num_	int	Yes	Number of KV heads
block_len_	int	Yes	Number of tokens per cache block
layer_id	int	Yes	Transformer layer index
head_id	int	Yes	KV head index
block_ptrs	char**	Yes	Array of pointers to cache blocks

Outputs

Name	Type	Description
k_data / v_data	Tkv* or SubBytePtr<Tkv>	Typed pointer to key/value data within a block
k_param / v_param	T*	Pointer to quantization scale/zero-point parameters

Usage Examples

using Cfg = block::Config<half, uint8_t, 128>;
block::Layout<Cfg> layout{Cfg{8, 64}};
block::Head<half, uint8_t, decltype(layout)> head{layout, layer_id, kv_head_idx};
auto k_ptr = head.k_data(block_ptr, timestep);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment