Implementation:InternLM Lmdeploy AttentionParams
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Defines the parameter structs passed to TurboMind attention kernels, including buffer pointers, sequence metadata, rotary embedding configuration, and split-K reduction state.
Description
This header declares AttentionParams<T>, the central parameter struct that carries all runtime information needed by attention kernels. It bundles token-level buffers (Q, K, V, output), optional biases, sequence cumulative lengths, rotary embedding parameters, quantization policy, context-parallel settings, and split-K partial output buffers. Two additional iterator parameter structs (LinearIteratorParams and BlockIteratorParams) encapsulate KV cache access modes. A factory helper CreateCacheIterFactory constructs cache iterator factories from these params.
Usage
Populate an AttentionParams<T> instance with pointers to device memory for Q/K/V tensors, cumulative sequence lengths, and model hyperparameters before passing it to invokeAttention or invokeDecoding.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/attention_params.h
- Lines: 1-112
Signature
struct LinearIteratorParams {
const void* kv_cache;
int stride_h;
int key_to_val;
};
struct BlockIteratorParams {
char** block_ptrs;
const int* cu_block_nums;
int layer_id;
int block_len;
};
template<typename T>
struct AttentionParams {
T* out;
T* q;
T* k;
T* v;
int64_t stride;
T* q_bias;
T* k_bias;
T* v_bias;
const int* cu_q_len;
const int* cu_k_len;
const bool* finished;
const float* rope_theta;
LinearIteratorParams linear_iter_params;
BlockIteratorParams block_iter_params;
int token_num;
int batch_size;
int max_q_len;
int max_k_len;
int num_heads;
int num_kv_heads;
int size_per_head;
float inv_sqrt_dh;
int window_size;
RopeKernelParam rope_param;
bool use_logn_attn;
int max_position_embeddings;
int quant_policy;
int max_split_k;
int* split_cnt;
float* partial_O;
float* partial_ML;
int cp_rank;
cutlass::FastDivmod cp_size;
int arch;
cudaStream_t stream;
};
template<class CacheIterFactory, class SFINAE = void>
struct CreateCacheIterFactory {
template<class Param>
static CacheIterFactory apply(const Param& param);
};
Import
#include "src/turbomind/kernels/attention/attention_params.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| q | T* | Yes | Query tensor, layout [B, qH+2kvH, D] or [B, kvH, D] |
| k | T* | Yes | Key tensor |
| v | T* | Yes | Value tensor |
| stride | int64_t | Yes | Stride between tokens in the QKV buffer |
| cu_q_len | const int* | Yes | Cumulative query lengths per batch |
| cu_k_len | const int* | Yes | Cumulative key lengths per batch |
| finished | const bool* | Yes | Per-sequence finished flags |
| rope_param | RopeKernelParam | Yes | Rotary embedding configuration |
| num_heads | int | Yes | Number of query attention heads |
| num_kv_heads | int | Yes | Number of KV attention heads |
| size_per_head | int | Yes | Head dimension |
| inv_sqrt_dh | float | Yes | 1/sqrt(head_dim) scaling factor |
| quant_policy | int | Yes | KV cache quantization policy (0=none, 4=int4, 8=int8) |
Outputs
| Name | Type | Description |
|---|---|---|
| out | T* | Output attention result, [token_num, num_heads, head_dim] |
| partial_O | float* | Partial output for split-K reduction |
| partial_ML | float* | Partial max/log-sum-exp for split-K reduction |
| split_cnt | int* | Actual split counts per token (for reduction kernel) |
Usage Examples
AttentionParams<half> params;
params.q = d_q;
params.k = d_k;
params.v = d_v;
params.out = d_out;
params.stride = (num_heads + 2 * num_kv_heads) * head_dim;
params.cu_q_len = d_cu_q_len;
params.cu_k_len = d_cu_k_len;
params.num_heads = 32;
params.num_kv_heads = 8;
params.size_per_head = 128;
params.inv_sqrt_dh = 1.0f / sqrtf(128.0f);
params.stream = stream;