Implementation:InternLM Lmdeploy KvCacheUtils
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Kernel launch declarations for writing K/V tensors into paged cache blocks (ProcessKV) and reading them back to linear memory (FlattenKV), with RoPE application and quantization support.
Description
This header declares two families of CUDA kernel wrappers: invokeProcessKV_v2 writes new K/V data into paged cache blocks, applying rotary positional embedding and optional quantization during the write; invokeFlattenKV_v2 reads KV data from paged blocks back into contiguous linear buffers (used for debugging or fallback paths). Convenience wrappers (invokeProcessKV_v2_, invokeFlattenKV_v2_) extract parameters directly from an AttentionParams struct. The get_cache_block_size utility computes the byte size of a single cache block given model configuration.
Usage
Called during the prefill phase to populate the paged KV cache before attention computation, or to export cache contents for linear-iterator-based attention or debugging.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/kv_cache_utils_v2.h
- Lines: 1-115
Signature
template<class T>
void invokeProcessKV_v2(
char** blocks, const T* k, const T* v,
const T* k_bias, const T* v_bias,
const int* cu_q_len, const int* cu_k_len,
const int* cu_block_num,
const RopeKernelParam& rope_param,
int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
int block_seq_len, int layer_id,
int cp_rank, cutlass::FastDivmod cp_size,
int max_q_len, int head_num, int head_dim,
int batch_size, int quant_policy,
cudaStream_t stream = {});
template<class T>
void invokeProcessKV_v2_(const AttentionParams<T>& params);
template<class T>
void invokeFlattenKV_v2(
T* k, T* v, char** blocks,
const int* cu_k_len, const int* cu_block_num,
const RopeKernelParam& rope_param,
int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
int block_seq_len, int layer_id,
int cp_rank, cutlass::FastDivmod cp_size,
int max_seq_len, int head_num, int head_dim,
int batch_size, int quant_policy,
cudaStream_t stream = {});
template<class T>
void invokeFlattenKV_v2_(const AttentionParams<T>& params, int sum_k_len);
size_t get_cache_block_size(
DataType dtype, DataType kvtype,
int layer_num, int head_num, int head_dim, int block_seq_len);
Import
#include "src/turbomind/kernels/attention/kv_cache_utils_v2.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| blocks | char** | Yes | Array of paged cache block pointers |
| k, v | const T* | Yes | Key and value tensors to write to cache |
| cu_q_len | const int* | Yes | Cumulative query lengths |
| cu_k_len | const int* | Yes | Cumulative key lengths |
| rope_param | RopeKernelParam | Yes | Rotary embedding parameters |
| quant_policy | int | Yes | Quantization policy (0, 4, 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| blocks (modified) | char** | KV data written into paged cache blocks |
| k, v (FlattenKV) | T* | Linearized KV data read from blocks |
Usage Examples
// Write K/V to paged cache during prefill
invokeProcessKV_v2_(attention_params);
// Query cache block size for memory allocation
size_t block_bytes = get_cache_block_size(
DataType::TYPE_FP16, DataType::TYPE_INT8,
32, 8, 128, 64);