Implementation:InternLM Lmdeploy AttentionReference
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
cuBLAS-based reference attention implementation for validation and debugging, providing unfused Q*K^T and softmax(QK)*V computation with intermediate result access.
Description
The Reference<T> class implements multi-head attention using cuBLAS GEMM operations rather than fused kernels. It is used for correctness verification of the fused attention implementations. The Reshape method allocates internal buffers (mask, QK scores, probabilities, separated Q/K/V) based on problem dimensions. The Execute method runs the full attention pipeline: applies rotary embedding to K via invokeApplyRotaryEmbedding, performs batched Q*K^T GEMM, applies causal masking and softmax, then computes P*V GEMM. Intermediate results (QK scores, probabilities, mask) are accessible via qk(), pr(), and mask() getters for comparison with fused kernel outputs.
Usage
Used in testing and debugging to generate ground-truth attention outputs that can be compared against the fused kernel results. Not used in production inference.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/reference.h
- Lines: 1-82
Signature
namespace turbomind {
template<class T>
void invokeApplyRotaryEmbedding(
T* k_cache, int max_k_len, int head_num, int head_dim,
float rope_base, int rope_dim, int batch_size,
cudaStream_t stream = {});
template<class T>
class Reference {
public:
explicit Reference(cudaStream_t stream);
void Reshape(size_t max_q_len, size_t max_k_len,
size_t head_num, size_t head_dim,
size_t kv_head_num, size_t batch_size,
int window_size);
void Execute(T* output, T* k_cache, T* v_cache,
const T* qkv, const T* qkv_bias,
const T* sinks,
float rope_base, int rope_dim);
const float* qk() const;
const T* pr() const;
const T* mask() const;
private:
cudaStream_t stream_;
cublasHandle_t cublas_;
// Internal buffers for mask, QK scores, probabilities, etc.
};
} // namespace turbomind
Import
#include "src/turbomind/kernels/attention/reference.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output | T* | Yes | Output buffer for attention result |
| k_cache | T* | Yes | Key cache (linear layout) |
| v_cache | T* | Yes | Value cache (linear layout) |
| qkv | const T* | Yes | Packed QKV input tensor |
| qkv_bias | const T* | No | Optional QKV bias |
| rope_base | float | Yes | RoPE base frequency |
| rope_dim | int | Yes | RoPE dimension |
| window_size | int | Yes | Sliding window size |
Outputs
| Name | Type | Description |
|---|---|---|
| output | T* | Computed attention output |
| qk() | const float* | QK attention scores (for debugging) |
| pr() | const T* | Softmax probabilities (for debugging) |
| mask() | const T* | Causal attention mask (for debugging) |
Usage Examples
Reference<half> ref(stream);
ref.Reshape(max_q_len, max_k_len, num_heads, head_dim, num_kv_heads, batch_size, window_size);
ref.Execute(output, k_cache, v_cache, qkv, qkv_bias, sinks, rope_base, rope_dim);
// Compare ref.qk() with fused kernel's QK scores