Implementation:Vllm project Vllm CPU Types ARM
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, SIMD, ARM |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Defines ARM NEON-based vector types and SIMD operations wrapping PyTorch's Vectorized<T> for portable high-performance computation on ARM CPUs.
Description
This header provides a comprehensive set of vectorized data types (FP32Vec8, FP32Vec16, FP16Vec8, FP16Vec16, BF16Vec8, BF16Vec16, INT8Vec16, INT32Vec16) built on top of ARM NEON intrinsics via PyTorch's at::vec::Vectorized abstraction. The NxVectorizedTVecReg template composes multiple NEON register-width vectors into wider logical vectors, supporting load/store, arithmetic, type conversions, and partial element operations. The implementation uses compile-time loop unrolling for optimal codegen.
Usage
This header is conditionally included when compiling the vLLM CPU backend on ARM platforms (e.g., Apple Silicon, AWS Graviton). It provides the SIMD primitive types used by all CPU kernel implementations including attention, activation, and GEMM routines.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/cpu_types_arm.hpp
- Lines: 1-926
Signature
namespace vec_op {
template <int N, typename T>
struct NxVectorizedTVecReg {
using value_t = T;
using VectorizedT = Vectorized<T>;
VectorizedT val[N];
static constexpr int size() noexcept;
void save(void* ptr) const;
void load(const void* ptr);
void save(void* ptr, const int elem_num) const;
void load(const void* ptr, const int elem_num) const;
};
// Type aliases defined via VectorizedRegWrapper
using FP32Vec8 = VectorizedRegWrapper<2, float>;
using FP32Vec16 = VectorizedRegWrapper<4, float>;
using FP16Vec8 = VectorizedRegWrapper<1, c10::Half>;
using FP16Vec16 = VectorizedRegWrapper<2, c10::Half>;
using BF16Vec8 = VectorizedRegWrapper<1, c10::BFloat16>;
using BF16Vec16 = VectorizedRegWrapper<2, c10::BFloat16>;
using INT8Vec16 = VectorizedRegWrapper<1, int8_t>;
using INT32Vec16 = VectorizedRegWrapper<4, int32_t>;
} // namespace vec_op
Import
#include "cpu/cpu_types_arm.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ptr | const void* | Yes | Pointer to source data for vector load operations |
| elem_num | int | No | Number of elements for partial load/store operations |
| v | scalar type | No | Scalar value to broadcast into all vector lanes |
Outputs
| Name | Type | Description |
|---|---|---|
| Vector register | NxVectorizedTVecReg<N, T> | SIMD vector containing loaded/computed elements |
Usage Examples
// Load 16 floats from memory into a vector
vec_op::FP32Vec16 vec(input_ptr);
// Perform element-wise multiply
vec_op::FP32Vec16 result = vec * scale_vec;
// Convert FP32 to BF16 and save
vec_op::BF16Vec16 bf16_result(result);
bf16_result.save(output_ptr);