Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy KvCacheUtils

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Kernel launch declarations for writing K/V tensors into paged cache blocks (ProcessKV) and reading them back to linear memory (FlattenKV), with RoPE application and quantization support.

Description

This header declares two families of CUDA kernel wrappers: invokeProcessKV_v2 writes new K/V data into paged cache blocks, applying rotary positional embedding and optional quantization during the write; invokeFlattenKV_v2 reads KV data from paged blocks back into contiguous linear buffers (used for debugging or fallback paths). Convenience wrappers (invokeProcessKV_v2_, invokeFlattenKV_v2_) extract parameters directly from an AttentionParams struct. The get_cache_block_size utility computes the byte size of a single cache block given model configuration.

Usage

Called during the prefill phase to populate the paged KV cache before attention computation, or to export cache contents for linear-iterator-based attention or debugging.

Code Reference

Source Location

Signature

template<class T>
void invokeProcessKV_v2(
    char** blocks, const T* k, const T* v,
    const T* k_bias, const T* v_bias,
    const int* cu_q_len, const int* cu_k_len,
    const int* cu_block_num,
    const RopeKernelParam& rope_param,
    int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
    int block_seq_len, int layer_id,
    int cp_rank, cutlass::FastDivmod cp_size,
    int max_q_len, int head_num, int head_dim,
    int batch_size, int quant_policy,
    cudaStream_t stream = {});

template<class T>
void invokeProcessKV_v2_(const AttentionParams<T>& params);

template<class T>
void invokeFlattenKV_v2(
    T* k, T* v, char** blocks,
    const int* cu_k_len, const int* cu_block_num,
    const RopeKernelParam& rope_param,
    int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s,
    int block_seq_len, int layer_id,
    int cp_rank, cutlass::FastDivmod cp_size,
    int max_seq_len, int head_num, int head_dim,
    int batch_size, int quant_policy,
    cudaStream_t stream = {});

template<class T>
void invokeFlattenKV_v2_(const AttentionParams<T>& params, int sum_k_len);

size_t get_cache_block_size(
    DataType dtype, DataType kvtype,
    int layer_num, int head_num, int head_dim, int block_seq_len);

Import

#include "src/turbomind/kernels/attention/kv_cache_utils_v2.h"

I/O Contract

Inputs

Name Type Required Description
blocks char** Yes Array of paged cache block pointers
k, v const T* Yes Key and value tensors to write to cache
cu_q_len const int* Yes Cumulative query lengths
cu_k_len const int* Yes Cumulative key lengths
rope_param RopeKernelParam Yes Rotary embedding parameters
quant_policy int Yes Quantization policy (0, 4, 8)

Outputs

Name Type Description
blocks (modified) char** KV data written into paged cache blocks
k, v (FlattenKV) T* Linearized KV data read from blocks

Usage Examples

// Write K/V to paged cache during prefill
invokeProcessKV_v2_(attention_params);

// Query cache block size for memory allocation
size_t block_bytes = get_cache_block_size(
    DataType::TYPE_FP16, DataType::TYPE_INT8,
    32, 8, 128, 64);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment