Implementation:Vllm project Vllm CPU Types ARM

Knowledge Sources	vllm
Domains	CPU_Inference, SIMD, ARM
Last Updated	2026-02-08 00:00 GMT

Overview

Defines ARM NEON-based vector types and SIMD operations wrapping PyTorch's Vectorized<T> for portable high-performance computation on ARM CPUs.

Description

This header provides a comprehensive set of vectorized data types (FP32Vec8, FP32Vec16, FP16Vec8, FP16Vec16, BF16Vec8, BF16Vec16, INT8Vec16, INT32Vec16) built on top of ARM NEON intrinsics via PyTorch's at::vec::Vectorized abstraction. The NxVectorizedTVecReg template composes multiple NEON register-width vectors into wider logical vectors, supporting load/store, arithmetic, type conversions, and partial element operations. The implementation uses compile-time loop unrolling for optimal codegen.

Usage

This header is conditionally included when compiling the vLLM CPU backend on ARM platforms (e.g., Apple Silicon, AWS Graviton). It provides the SIMD primitive types used by all CPU kernel implementations including attention, activation, and GEMM routines.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/cpu_types_arm.hpp
Lines: 1-926

Signature

namespace vec_op {

template <int N, typename T>
struct NxVectorizedTVecReg {
    using value_t = T;
    using VectorizedT = Vectorized<T>;
    VectorizedT val[N];

    static constexpr int size() noexcept;
    void save(void* ptr) const;
    void load(const void* ptr);
    void save(void* ptr, const int elem_num) const;
    void load(const void* ptr, const int elem_num) const;
};

// Type aliases defined via VectorizedRegWrapper
using FP32Vec8  = VectorizedRegWrapper<2, float>;
using FP32Vec16 = VectorizedRegWrapper<4, float>;
using FP16Vec8  = VectorizedRegWrapper<1, c10::Half>;
using FP16Vec16 = VectorizedRegWrapper<2, c10::Half>;
using BF16Vec8  = VectorizedRegWrapper<1, c10::BFloat16>;
using BF16Vec16 = VectorizedRegWrapper<2, c10::BFloat16>;
using INT8Vec16 = VectorizedRegWrapper<1, int8_t>;
using INT32Vec16 = VectorizedRegWrapper<4, int32_t>;

} // namespace vec_op

Import

#include "cpu/cpu_types_arm.hpp"

I/O Contract

Inputs

Name	Type	Required	Description
ptr	const void*	Yes	Pointer to source data for vector load operations
elem_num	int	No	Number of elements for partial load/store operations
v	scalar type	No	Scalar value to broadcast into all vector lanes

Outputs

Name	Type	Description
Vector register	NxVectorizedTVecReg<N, T>	SIMD vector containing loaded/computed elements

Usage Examples

// Load 16 floats from memory into a vector
vec_op::FP32Vec16 vec(input_ptr);

// Perform element-wise multiply
vec_op::FP32Vec16 result = vec * scale_vec;

// Convert FP32 to BF16 and save
vec_op::BF16Vec16 bf16_result(result);
bf16_result.save(output_ptr);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment