Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionParams

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Defines the parameter structs passed to TurboMind attention kernels, including buffer pointers, sequence metadata, rotary embedding configuration, and split-K reduction state.

Description

This header declares AttentionParams<T>, the central parameter struct that carries all runtime information needed by attention kernels. It bundles token-level buffers (Q, K, V, output), optional biases, sequence cumulative lengths, rotary embedding parameters, quantization policy, context-parallel settings, and split-K partial output buffers. Two additional iterator parameter structs (LinearIteratorParams and BlockIteratorParams) encapsulate KV cache access modes. A factory helper CreateCacheIterFactory constructs cache iterator factories from these params.

Usage

Populate an AttentionParams<T> instance with pointers to device memory for Q/K/V tensors, cumulative sequence lengths, and model hyperparameters before passing it to invokeAttention or invokeDecoding.

Code Reference

Source Location

Signature

struct LinearIteratorParams {
    const void* kv_cache;
    int         stride_h;
    int         key_to_val;
};

struct BlockIteratorParams {
    char**     block_ptrs;
    const int* cu_block_nums;
    int        layer_id;
    int        block_len;
};

template<typename T>
struct AttentionParams {
    T*      out;
    T*      q;
    T*      k;
    T*      v;
    int64_t stride;

    T* q_bias;
    T* k_bias;
    T* v_bias;

    const int*   cu_q_len;
    const int*   cu_k_len;
    const bool*  finished;
    const float* rope_theta;

    LinearIteratorParams linear_iter_params;
    BlockIteratorParams  block_iter_params;

    int token_num;
    int batch_size;
    int max_q_len;
    int max_k_len;

    int   num_heads;
    int   num_kv_heads;
    int   size_per_head;
    float inv_sqrt_dh;
    int   window_size;

    RopeKernelParam rope_param;
    bool use_logn_attn;
    int  max_position_embeddings;
    int  quant_policy;

    int    max_split_k;
    int*   split_cnt;
    float* partial_O;
    float* partial_ML;

    int                 cp_rank;
    cutlass::FastDivmod cp_size;
    int                 arch;
    cudaStream_t        stream;
};

template<class CacheIterFactory, class SFINAE = void>
struct CreateCacheIterFactory {
    template<class Param>
    static CacheIterFactory apply(const Param& param);
};

Import

#include "src/turbomind/kernels/attention/attention_params.h"

I/O Contract

Inputs

Name Type Required Description
q T* Yes Query tensor, layout [B, qH+2kvH, D] or [B, kvH, D]
k T* Yes Key tensor
v T* Yes Value tensor
stride int64_t Yes Stride between tokens in the QKV buffer
cu_q_len const int* Yes Cumulative query lengths per batch
cu_k_len const int* Yes Cumulative key lengths per batch
finished const bool* Yes Per-sequence finished flags
rope_param RopeKernelParam Yes Rotary embedding configuration
num_heads int Yes Number of query attention heads
num_kv_heads int Yes Number of KV attention heads
size_per_head int Yes Head dimension
inv_sqrt_dh float Yes 1/sqrt(head_dim) scaling factor
quant_policy int Yes KV cache quantization policy (0=none, 4=int4, 8=int8)

Outputs

Name Type Description
out T* Output attention result, [token_num, num_heads, head_dim]
partial_O float* Partial output for split-K reduction
partial_ML float* Partial max/log-sum-exp for split-K reduction
split_cnt int* Actual split counts per token (for reduction kernel)

Usage Examples

AttentionParams<half> params;
params.q = d_q;
params.k = d_k;
params.v = d_v;
params.out = d_out;
params.stride = (num_heads + 2 * num_kv_heads) * head_dim;
params.cu_q_len = d_cu_q_len;
params.cu_k_len = d_cu_k_len;
params.num_heads = 32;
params.num_kv_heads = 8;
params.size_per_head = 128;
params.inv_sqrt_dh = 1.0f / sqrtf(128.0f);
params.stream = stream;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment