Implementation:InternLM Lmdeploy AttentionReference

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

cuBLAS-based reference attention implementation for validation and debugging, providing unfused Q*K^T and softmax(QK)*V computation with intermediate result access.

Description

The Reference<T> class implements multi-head attention using cuBLAS GEMM operations rather than fused kernels. It is used for correctness verification of the fused attention implementations. The Reshape method allocates internal buffers (mask, QK scores, probabilities, separated Q/K/V) based on problem dimensions. The Execute method runs the full attention pipeline: applies rotary embedding to K via invokeApplyRotaryEmbedding, performs batched Q*K^T GEMM, applies causal masking and softmax, then computes P*V GEMM. Intermediate results (QK scores, probabilities, mask) are accessible via qk(), pr(), and mask() getters for comparison with fused kernel outputs.

Usage

Used in testing and debugging to generate ground-truth attention outputs that can be compared against the fused kernel results. Not used in production inference.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/reference.h
Lines: 1-82

Signature

namespace turbomind {

template<class T>
void invokeApplyRotaryEmbedding(
    T* k_cache, int max_k_len, int head_num, int head_dim,
    float rope_base, int rope_dim, int batch_size,
    cudaStream_t stream = {});

template<class T>
class Reference {
public:
    explicit Reference(cudaStream_t stream);

    void Reshape(size_t max_q_len, size_t max_k_len,
                 size_t head_num, size_t head_dim,
                 size_t kv_head_num, size_t batch_size,
                 int window_size);

    void Execute(T* output, T* k_cache, T* v_cache,
                 const T* qkv, const T* qkv_bias,
                 const T* sinks,
                 float rope_base, int rope_dim);

    const float* qk() const;
    const T* pr() const;
    const T* mask() const;

private:
    cudaStream_t    stream_;
    cublasHandle_t  cublas_;
    // Internal buffers for mask, QK scores, probabilities, etc.
};

} // namespace turbomind

Import

#include "src/turbomind/kernels/attention/reference.h"

I/O Contract

Inputs

Name	Type	Required	Description
output	T*	Yes	Output buffer for attention result
k_cache	T*	Yes	Key cache (linear layout)
v_cache	T*	Yes	Value cache (linear layout)
qkv	const T*	Yes	Packed QKV input tensor
qkv_bias	const T*	No	Optional QKV bias
rope_base	float	Yes	RoPE base frequency
rope_dim	int	Yes	RoPE dimension
window_size	int	Yes	Sliding window size

Outputs

Name	Type	Description
output	T*	Computed attention output
qk()	const float*	QK attention scores (for debugging)
pr()	const T*	Softmax probabilities (for debugging)
mask()	const T*	Causal attention mask (for debugging)

Usage Examples

Reference<half> ref(stream);
ref.Reshape(max_q_len, max_k_len, num_heads, head_dim, num_kv_heads, batch_size, window_size);
ref.Execute(output, k_cache, v_cache, qkv, qkv_bias, sinks, rope_base, rope_dim);
// Compare ref.qk() with fused kernel's QK scores

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment