Implementation:InternLM Lmdeploy AttentionParams

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Defines the parameter structs passed to TurboMind attention kernels, including buffer pointers, sequence metadata, rotary embedding configuration, and split-K reduction state.

Description

This header declares AttentionParams<T>, the central parameter struct that carries all runtime information needed by attention kernels. It bundles token-level buffers (Q, K, V, output), optional biases, sequence cumulative lengths, rotary embedding parameters, quantization policy, context-parallel settings, and split-K partial output buffers. Two additional iterator parameter structs (LinearIteratorParams and BlockIteratorParams) encapsulate KV cache access modes. A factory helper CreateCacheIterFactory constructs cache iterator factories from these params.

Usage

Populate an AttentionParams<T> instance with pointers to device memory for Q/K/V tensors, cumulative sequence lengths, and model hyperparameters before passing it to invokeAttention or invokeDecoding.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/attention_params.h
Lines: 1-112

Signature

struct LinearIteratorParams {
    const void* kv_cache;
    int         stride_h;
    int         key_to_val;
};

struct BlockIteratorParams {
    char**     block_ptrs;
    const int* cu_block_nums;
    int        layer_id;
    int        block_len;
};

template<typename T>
struct AttentionParams {
    T*      out;
    T*      q;
    T*      k;
    T*      v;
    int64_t stride;

    T* q_bias;
    T* k_bias;
    T* v_bias;

    const int*   cu_q_len;
    const int*   cu_k_len;
    const bool*  finished;
    const float* rope_theta;

    LinearIteratorParams linear_iter_params;
    BlockIteratorParams  block_iter_params;

    int token_num;
    int batch_size;
    int max_q_len;
    int max_k_len;

    int   num_heads;
    int   num_kv_heads;
    int   size_per_head;
    float inv_sqrt_dh;
    int   window_size;

    RopeKernelParam rope_param;
    bool use_logn_attn;
    int  max_position_embeddings;
    int  quant_policy;

    int    max_split_k;
    int*   split_cnt;
    float* partial_O;
    float* partial_ML;

    int                 cp_rank;
    cutlass::FastDivmod cp_size;
    int                 arch;
    cudaStream_t        stream;
};

template<class CacheIterFactory, class SFINAE = void>
struct CreateCacheIterFactory {
    template<class Param>
    static CacheIterFactory apply(const Param& param);
};

Import

#include "src/turbomind/kernels/attention/attention_params.h"

I/O Contract

Inputs

Name	Type	Required	Description
q	T*	Yes	Query tensor, layout [B, qH+2kvH, D] or [B, kvH, D]
k	T*	Yes	Key tensor
v	T*	Yes	Value tensor
stride	int64_t	Yes	Stride between tokens in the QKV buffer
cu_q_len	const int*	Yes	Cumulative query lengths per batch
cu_k_len	const int*	Yes	Cumulative key lengths per batch
finished	const bool*	Yes	Per-sequence finished flags
rope_param	RopeKernelParam	Yes	Rotary embedding configuration
num_heads	int	Yes	Number of query attention heads
num_kv_heads	int	Yes	Number of KV attention heads
size_per_head	int	Yes	Head dimension
inv_sqrt_dh	float	Yes	1/sqrt(head_dim) scaling factor
quant_policy	int	Yes	KV cache quantization policy (0=none, 4=int4, 8=int8)

Outputs

Name	Type	Description
out	T*	Output attention result, [token_num, num_heads, head_dim]
partial_O	float*	Partial output for split-K reduction
partial_ML	float*	Partial max/log-sum-exp for split-K reduction
split_cnt	int*	Actual split counts per token (for reduction kernel)

Usage Examples

AttentionParams<half> params;
params.q = d_q;
params.k = d_k;
params.v = d_v;
params.out = d_out;
params.stride = (num_heads + 2 * num_kv_heads) * head_dim;
params.cu_q_len = d_cu_q_len;
params.cu_k_len = d_cu_k_len;
params.num_heads = 32;
params.num_kv_heads = 8;
params.size_per_head = 128;
params.inv_sqrt_dh = 1.0f / sqrtf(128.0f);
params.stream = stream;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment