Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm CPU Attn NEON

From Leeroopedia


Knowledge Sources
Domains Attention, CPU_Inference, SIMD, ARM
Last Updated 2026-02-08 00:00 GMT

Overview

Implements ARM NEON-optimized attention kernels using float32x4_t SIMD instructions with FMA (fused multiply-add) micro-kernels for efficient attention on ARM CPUs.

Description

This header provides the gemm_micro_neon_fmla_Mx8_Ku4 template micro-kernel that performs Mx8 GEMM with K unrolled by 4 using NEON FMLA instructions (vfmaq_laneq_f32). It supports M from 1 to 8 with compile-time code generation via macro-based row expansion. The load_row8_B_as_f32 template functions handle loading KV cache data from float, half, or BFloat16 formats into float32x4_t registers. The file also provides TileGemmNeonFMLA for tile-level GEMM orchestration and AttentionImpl<ISA::NEON, ...> specializations for both float and BFloat16 scalar types. When ARM BF16 hardware support is detected, it conditionally includes the BFMMLA backend.

Usage

This header is compiled on ARM platforms (e.g., AWS Graviton, Ampere Altra) for CPU-based inference. It is selected when the runtime ISA detection identifies NEON as the available instruction set and serves as the default ARM attention backend.

Code Reference

Source Location

Signature

namespace cpu_attention {

// Load 8 elements from KV cache as float32 NEON vectors
template <typename kv_cache_t>
FORCE_INLINE void load_row8_B_as_f32(const kv_cache_t* p,
                                     float32x4_t& b0, float32x4_t& b1);

// Micro-kernel: Mx8 GEMM with K unrolled by 4 using NEON FMLA
template <int32_t M, typename kv_cache_t>
FORCE_INLINE void gemm_micro_neon_fmla_Mx8_Ku4(
    const float* __restrict A,       // [M x K]
    const kv_cache_t* __restrict B,  // [K x 8]
    float* __restrict C,             // [M x 8]
    int64_t lda, int64_t ldb, int64_t ldc,
    int32_t K, bool accumulate);

// Tile GEMM wrapper for attention phases
template <typename kv_cache_t>
class TileGemmNeonFMLA { ... };

// NEON attention implementations
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, float, head_dim> { ... };
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, c10::BFloat16, head_dim> { ... };

} // namespace cpu_attention

Import

#include "cpu_attn_neon.hpp"

I/O Contract

Inputs

Name Type Required Description
A const float* Yes Pointer to A matrix (Q heads for QK phase, softmax scores for PV phase); row-major [M x K]
B const kv_cache_t* Yes Pointer to B matrix (K cache for QK, V cache for PV); supports float, c10::Half, and c10::BFloat16
C float* Yes Pointer to output matrix; row-major [M x 8]
M int32_t (template) Yes Number of rows (1 to 8); determined at compile time
K int32_t Yes Reduction dimension (head_dim for QK, seq_len tile for PV)
lda, ldb, ldc int64_t Yes Leading dimensions of A, B, and C
accumulate bool Yes Whether to accumulate into C or zero-initialize before computation

Outputs

Name Type Description
C float* Updated output matrix with GEMM results (attention scores or weighted values)

Usage Examples

#include "cpu_attn_neon.hpp"

// Perform 4x8 QK GEMM with BFloat16 KV cache
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
    q_heads,       // [4 x head_dim] float
    k_cache_block, // [head_dim x 8] BFloat16
    logits_out,    // [4 x 8] float
    head_dim,      // lda
    8,             // ldb
    8,             // ldc
    head_dim,      // K
    false          // zero-init, don't accumulate
);

// Perform 4x8 PV GEMM accumulating over sequence blocks
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
    softmax_scores, // [4 x seq_tile] float
    v_cache_block,  // [seq_tile x 8] BFloat16
    output,         // [4 x 8] float
    seq_tile,       // lda
    8,              // ldb
    head_dim,       // ldc
    seq_tile,       // K
    true            // accumulate
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment