Implementation:Vllm project Vllm CPU Attn NEON

Knowledge Sources	vllm
Domains	Attention, CPU_Inference, SIMD, ARM
Last Updated	2026-02-08 00:00 GMT

Overview

Implements ARM NEON-optimized attention kernels using float32x4_t SIMD instructions with FMA (fused multiply-add) micro-kernels for efficient attention on ARM CPUs.

Description

This header provides the gemm_micro_neon_fmla_Mx8_Ku4 template micro-kernel that performs Mx8 GEMM with K unrolled by 4 using NEON FMLA instructions (vfmaq_laneq_f32). It supports M from 1 to 8 with compile-time code generation via macro-based row expansion. The load_row8_B_as_f32 template functions handle loading KV cache data from float, half, or BFloat16 formats into float32x4_t registers. The file also provides TileGemmNeonFMLA for tile-level GEMM orchestration and AttentionImpl<ISA::NEON, ...> specializations for both float and BFloat16 scalar types. When ARM BF16 hardware support is detected, it conditionally includes the BFMMLA backend.

Usage

This header is compiled on ARM platforms (e.g., AWS Graviton, Ampere Altra) for CPU-based inference. It is selected when the runtime ISA detection identifies NEON as the available instruction set and serves as the default ARM attention backend.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/cpu_attn_neon.hpp
Lines: 1-401

Signature

namespace cpu_attention {

// Load 8 elements from KV cache as float32 NEON vectors
template <typename kv_cache_t>
FORCE_INLINE void load_row8_B_as_f32(const kv_cache_t* p,
                                     float32x4_t& b0, float32x4_t& b1);

// Micro-kernel: Mx8 GEMM with K unrolled by 4 using NEON FMLA
template <int32_t M, typename kv_cache_t>
FORCE_INLINE void gemm_micro_neon_fmla_Mx8_Ku4(
    const float* __restrict A,       // [M x K]
    const kv_cache_t* __restrict B,  // [K x 8]
    float* __restrict C,             // [M x 8]
    int64_t lda, int64_t ldb, int64_t ldc,
    int32_t K, bool accumulate);

// Tile GEMM wrapper for attention phases
template <typename kv_cache_t>
class TileGemmNeonFMLA { ... };

// NEON attention implementations
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, float, head_dim> { ... };
template <int64_t head_dim>
class AttentionImpl<ISA::NEON, c10::BFloat16, head_dim> { ... };

} // namespace cpu_attention

Import

#include "cpu_attn_neon.hpp"

I/O Contract

Inputs

Name	Type	Required	Description
A	`const float*`	Yes	Pointer to A matrix (Q heads for QK phase, softmax scores for PV phase); row-major [M x K]
B	`const kv_cache_t*`	Yes	Pointer to B matrix (K cache for QK, V cache for PV); supports float, c10::Half, and c10::BFloat16
C	`float*`	Yes	Pointer to output matrix; row-major [M x 8]
M	`int32_t` (template)	Yes	Number of rows (1 to 8); determined at compile time
K	`int32_t`	Yes	Reduction dimension (head_dim for QK, seq_len tile for PV)
lda, ldb, ldc	`int64_t`	Yes	Leading dimensions of A, B, and C
accumulate	`bool`	Yes	Whether to accumulate into C or zero-initialize before computation

Outputs

Name	Type	Description
C	`float*`	Updated output matrix with GEMM results (attention scores or weighted values)

Usage Examples

#include "cpu_attn_neon.hpp"

// Perform 4x8 QK GEMM with BFloat16 KV cache
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
    q_heads,       // [4 x head_dim] float
    k_cache_block, // [head_dim x 8] BFloat16
    logits_out,    // [4 x 8] float
    head_dim,      // lda
    8,             // ldb
    8,             // ldc
    head_dim,      // K
    false          // zero-init, don't accumulate
);

// Perform 4x8 PV GEMM accumulating over sequence blocks
gemm_micro_neon_fmla_Mx8_Ku4<4, c10::BFloat16>(
    softmax_scores, // [4 x seq_tile] float
    v_cache_block,  // [seq_tile x 8] BFloat16
    output,         // [4 x 8] float
    seq_tile,       // lda
    8,              // ldb
    head_dim,       // ldc
    seq_tile,       // K
    true            // accumulate
);

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment