Implementation:Vllm project Vllm CPU Attn NEON
| Knowledge Sources | |
|---|---|
| Domains | Attention, CPU_Inference, SIMD, ARM |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements ARM NEON-optimized attention kernels using float32x4_t SIMD instructions with FMA (fused multiply-add) micro-kernels for efficient attention on ARM CPUs.
Description
This header provides the gemm_micro_neon_fmla_Mx8_Ku4 template micro-kernel that performs Mx8 GEMM with K unrolled by 4 using NEON FMLA instructions (vfmaq_laneq_f32). It supports M from 1 to 8 with compile-time code generation via macro-based row expansion. The load_row8_B_as_f32 template functions handle loading KV cache data from float, half, or BFloat16 formats into float32x4_t registers. The file also provides TileGemmNeonFMLA for tile-level GEMM orchestration and AttentionImpl<ISA::NEON, ...> specializations for both float and BFloat16 scalar types. When ARM BF16 hardware support is detected, it conditionally includes the BFMMLA backend.
Usage
This header is compiled on ARM platforms (e.g., AWS Graviton, Ampere Altra) for CPU-based inference. It is selected when the runtime ISA detection identifies NEON as the available instruction set and serves as the default ARM attention backend.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/cpu_attn_neon.hpp
- Lines: 1-401
Signature
namespace cpu_attention {
// Load 8 elements from KV cache as float32 NEON vectors
template <typename kv_cache_t>
FORCE_INLINE void load_row8_B_as_f32(const kv_cache_t* p,
float32x4_t& b0, float32x4_t& b1);
// Micro-kernel: Mx8 GEMM with K unrolled by 4 using NEON FMLA
template <int32_t M, typename kv_cache_t>
FORCE_INLINE void gemm_micro_neon_fmla_Mx8_Ku4(
const float* __restrict A, // [M x K]
const kv_cache_t* __restrict B, // [K x 8]
float* __restrict C, // [M x 8]
int64_t lda, int64_t ldb, int64_t ldc,
int32_t K, bool accumulate);
// Tile GEMM wrapper for attention phases
template <typename kv_cache_t>
class TileGemmNeonFMLA { ... };
// NEON attention implementations
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, float, head_dim> { ... };
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, c10::BFloat16, head_dim> { ... };
} // namespace cpu_attention
Import
#include "cpu_attn_neon.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| A | const float* |
Yes | Pointer to A matrix (Q heads for QK phase, softmax scores for PV phase); row-major [M x K] |
| B | const kv_cache_t* |
Yes | Pointer to B matrix (K cache for QK, V cache for PV); supports float, c10::Half, and c10::BFloat16 |
| C | float* |
Yes | Pointer to output matrix; row-major [M x 8] |
| M | int32_t (template) |
Yes | Number of rows (1 to 8); determined at compile time |
| K | int32_t |
Yes | Reduction dimension (head_dim for QK, seq_len tile for PV) |
| lda, ldb, ldc | int64_t |
Yes | Leading dimensions of A, B, and C |
| accumulate | bool |
Yes | Whether to accumulate into C or zero-initialize before computation |
Outputs
| Name | Type | Description |
|---|---|---|
| C | float* |
Updated output matrix with GEMM results (attention scores or weighted values) |
Usage Examples
#include "cpu_attn_neon.hpp"
// Perform 4x8 QK GEMM with BFloat16 KV cache
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
q_heads, // [4 x head_dim] float
k_cache_block, // [head_dim x 8] BFloat16
logits_out, // [4 x 8] float
head_dim, // lda
8, // ldb
8, // ldc
head_dim, // K
false // zero-init, don't accumulate
);
// Perform 4x8 PV GEMM accumulating over sequence blocks
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
softmax_scores, // [4 x seq_tile] float
v_cache_block, // [seq_tile x 8] BFloat16
output, // [4 x 8] float
seq_tile, // lda
8, // ldb
head_dim, // ldc
seq_tile, // K
true // accumulate
);