Implementation:InternLM Lmdeploy KvCacheUtils

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Kernel launch declarations for writing K/V tensors into paged cache blocks (ProcessKV) and reading them back to linear memory (FlattenKV), with RoPE application and quantization support.

Description

This header declares two families of CUDA kernel wrappers: invokeProcessKV_v2 writes new K/V data into paged cache blocks, applying rotary positional embedding and optional quantization during the write; invokeFlattenKV_v2 reads KV data from paged blocks back into contiguous linear buffers (used for debugging or fallback paths). Convenience wrappers (invokeProcessKV_v2_, invokeFlattenKV_v2_) extract parameters directly from an AttentionParams struct. The get_cache_block_size utility computes the byte size of a single cache block given model configuration.

Usage

Called during the prefill phase to populate the paged KV cache before attention computation, or to export cache contents for linear-iterator-based attention or debugging.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/kv_cache_utils_v2.h
Lines: 1-115

Signature

template<class T>
void invokeProcessKV_v2(
    char** blocks, const T* k, const T* v,
    const T* k_bias, const T* v_bias,
    const int* cu_q_len, const int* cu_k_len,
    const int* cu_block_num,
    const RopeKernelParam& rope_param,
    int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
    int block_seq_len, int layer_id,
    int cp_rank, cutlass::FastDivmod cp_size,
    int max_q_len, int head_num, int head_dim,
    int batch_size, int quant_policy,
    cudaStream_t stream = {});

template<class T>
void invokeProcessKV_v2_(const AttentionParams<T>& params);

template<class T>
void invokeFlattenKV_v2(
    T* k, T* v, char** blocks,
    const int* cu_k_len, const int* cu_block_num,
    const RopeKernelParam& rope_param,
    int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
    int block_seq_len, int layer_id,
    int cp_rank, cutlass::FastDivmod cp_size,
    int max_seq_len, int head_num, int head_dim,
    int batch_size, int quant_policy,
    cudaStream_t stream = {});

template<class T>
void invokeFlattenKV_v2_(const AttentionParams<T>& params, int sum_k_len);

size_t get_cache_block_size(
    DataType dtype, DataType kvtype,
    int layer_num, int head_num, int head_dim, int block_seq_len);

Import

#include "src/turbomind/kernels/attention/kv_cache_utils_v2.h"

I/O Contract

Inputs

Name	Type	Required	Description
blocks	char**	Yes	Array of paged cache block pointers
k, v	const T*	Yes	Key and value tensors to write to cache
cu_q_len	const int*	Yes	Cumulative query lengths
cu_k_len	const int*	Yes	Cumulative key lengths
rope_param	RopeKernelParam	Yes	Rotary embedding parameters
quant_policy	int	Yes	Quantization policy (0, 4, 8)

Outputs

Name	Type	Description
blocks (modified)	char**	KV data written into paged cache blocks
k, v (FlattenKV)	T*	Linearized KV data read from blocks

Usage Examples

// Write K/V to paged cache during prefill
invokeProcessKV_v2_(attention_params);

// Query cache block size for memory allocation
size_t block_bytes = get_cache_block_size(
    DataType::TYPE_FP16, DataType::TYPE_INT8,
    32, 8, 128, 64);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment