Implementation:InternLM Lmdeploy AttentionBlock
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Defines the paged (blocked) KV cache memory layout, including block configuration, per-head data/parameter accessors, and byte-offset computation for quantized and full-precision cache storage.
Description
This header implements the paged KV cache block abstraction used by TurboMind. The block::Config struct captures per-head dimensions, block length, and quantization bit widths. The block::Layout struct computes byte offsets for K/V data and quantization parameters within a memory block following an L(H2SDQ+H2S2T) layout scheme (Layer, Head, Sequence, Data/Quantization). The block::Head class provides typed accessors (k_data, v_data, k_param, v_param) that resolve a (layer_id, head_id, timestep) coordinate to a pointer within a block, supporting sub-byte types (uint4_t) through SubBytePtr.
Usage
Used internally by BlockIterator and BlockIteratorFactory to navigate the paged KV cache during attention kernel execution.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/block.h
- Lines: 1-237
Signature
namespace turbomind::block {
template<class T, class Tkv, int HeadDim>
struct Config {
int head_num_;
int block_len_;
TM_HOST_DEVICE constexpr int t_bits() const;
TM_HOST_DEVICE constexpr int q_bits() const;
TM_HOST_DEVICE constexpr int head_dim() const;
TM_HOST_DEVICE int head_num() const;
TM_HOST_DEVICE constexpr int block_len() const;
};
template<class T, class Tkv, class Layout>
class Head {
public:
TM_HOST_DEVICE Head(Layout layout, int layer_id, int head_id);
TM_HOST_DEVICE auto k_data(char* block, int ti) const;
TM_HOST_DEVICE auto v_data(char* block, int ti) const;
TM_HOST_DEVICE T* k_param(char* block, int ti) const;
TM_HOST_DEVICE T* v_param(char* block, int ti) const;
template<class Func>
TM_HOST_DEVICE auto with(char** block_ptrs, int ti, Func&& func) const;
};
template<class Config_>
struct Layout {
TM_HOST_DEVICE int k_data(int layer, int head, int token) const;
TM_HOST_DEVICE int v_data(int layer, int head, int token) const;
TM_HOST_DEVICE int k_param(int layer, int head, int token) const;
TM_HOST_DEVICE int v_param(int layer, int head, int token) const;
TM_HOST_DEVICE int block_size(int layer_num) const;
};
} // namespace turbomind::block
Import
#include "src/turbomind/kernels/attention/block.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| head_num_ | int | Yes | Number of KV heads |
| block_len_ | int | Yes | Number of tokens per cache block |
| layer_id | int | Yes | Transformer layer index |
| head_id | int | Yes | KV head index |
| block_ptrs | char** | Yes | Array of pointers to cache blocks |
Outputs
| Name | Type | Description |
|---|---|---|
| k_data / v_data | Tkv* or SubBytePtr<Tkv> | Typed pointer to key/value data within a block |
| k_param / v_param | T* | Pointer to quantization scale/zero-point parameters |
Usage Examples
using Cfg = block::Config<half, uint8_t, 128>;
block::Layout<Cfg> layout{Cfg{8, 64}};
block::Head<half, uint8_t, decltype(layout)> head{layout, layer_id, kv_head_idx};
auto k_ptr = head.k_data(block_ptr, timestep);