Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionReference

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

cuBLAS-based reference attention implementation for validation and debugging, providing unfused Q*K^T and softmax(QK)*V computation with intermediate result access.

Description

The Reference<T> class implements multi-head attention using cuBLAS GEMM operations rather than fused kernels. It is used for correctness verification of the fused attention implementations. The Reshape method allocates internal buffers (mask, QK scores, probabilities, separated Q/K/V) based on problem dimensions. The Execute method runs the full attention pipeline: applies rotary embedding to K via invokeApplyRotaryEmbedding, performs batched Q*K^T GEMM, applies causal masking and softmax, then computes P*V GEMM. Intermediate results (QK scores, probabilities, mask) are accessible via qk(), pr(), and mask() getters for comparison with fused kernel outputs.

Usage

Used in testing and debugging to generate ground-truth attention outputs that can be compared against the fused kernel results. Not used in production inference.

Code Reference

Source Location

Signature

namespace turbomind {

template<class T>
void invokeApplyRotaryEmbedding(
    T* k_cache, int max_k_len, int head_num, int head_dim,
    float rope_base, int rope_dim, int batch_size,
    cudaStream_t stream = {});

template<class T>
class Reference {
public:
    explicit Reference(cudaStream_t stream);

    void Reshape(size_t max_q_len, size_t max_k_len,
                 size_t head_num, size_t head_dim,
                 size_t kv_head_num, size_t batch_size,
                 int window_size);

    void Execute(T* output, T* k_cache, T* v_cache,
                 const T* qkv, const T* qkv_bias,
                 const T* sinks,
                 float rope_base, int rope_dim);

    const float* qk() const;
    const T* pr() const;
    const T* mask() const;

private:
    cudaStream_t    stream_;
    cublasHandle_t  cublas_;
    // Internal buffers for mask, QK scores, probabilities, etc.
};

} // namespace turbomind

Import

#include "src/turbomind/kernels/attention/reference.h"

I/O Contract

Inputs

Name Type Required Description
output T* Yes Output buffer for attention result
k_cache T* Yes Key cache (linear layout)
v_cache T* Yes Value cache (linear layout)
qkv const T* Yes Packed QKV input tensor
qkv_bias const T* No Optional QKV bias
rope_base float Yes RoPE base frequency
rope_dim int Yes RoPE dimension
window_size int Yes Sliding window size

Outputs

Name Type Description
output T* Computed attention output
qk() const float* QK attention scores (for debugging)
pr() const T* Softmax probabilities (for debugging)
mask() const T* Causal attention mask (for debugging)

Usage Examples

Reference<half> ref(stream);
ref.Reshape(max_q_len, max_k_len, num_heads, head_dim, num_kv_heads, batch_size, window_size);
ref.Execute(output, k_cache, v_cache, qkv, qkv_bias, sinks, rope_base, rope_dim);
// Compare ref.qk() with fused kernel's QK scores

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment