Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm CPU Types ARM

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, SIMD, ARM
Last Updated 2026-02-08 00:00 GMT

Overview

Defines ARM NEON-based vector types and SIMD operations wrapping PyTorch's Vectorized<T> for portable high-performance computation on ARM CPUs.

Description

This header provides a comprehensive set of vectorized data types (FP32Vec8, FP32Vec16, FP16Vec8, FP16Vec16, BF16Vec8, BF16Vec16, INT8Vec16, INT32Vec16) built on top of ARM NEON intrinsics via PyTorch's at::vec::Vectorized abstraction. The NxVectorizedTVecReg template composes multiple NEON register-width vectors into wider logical vectors, supporting load/store, arithmetic, type conversions, and partial element operations. The implementation uses compile-time loop unrolling for optimal codegen.

Usage

This header is conditionally included when compiling the vLLM CPU backend on ARM platforms (e.g., Apple Silicon, AWS Graviton). It provides the SIMD primitive types used by all CPU kernel implementations including attention, activation, and GEMM routines.

Code Reference

Source Location

Signature

namespace vec_op {

template <int N, typename T>
struct NxVectorizedTVecReg {
    using value_t = T;
    using VectorizedT = Vectorized<T>;
    VectorizedT val[N];

    static constexpr int size() noexcept;
    void save(void* ptr) const;
    void load(const void* ptr);
    void save(void* ptr, const int elem_num) const;
    void load(const void* ptr, const int elem_num) const;
};

// Type aliases defined via VectorizedRegWrapper
using FP32Vec8  = VectorizedRegWrapper<2, float>;
using FP32Vec16 = VectorizedRegWrapper<4, float>;
using FP16Vec8  = VectorizedRegWrapper<1, c10::Half>;
using FP16Vec16 = VectorizedRegWrapper<2, c10::Half>;
using BF16Vec8  = VectorizedRegWrapper<1, c10::BFloat16>;
using BF16Vec16 = VectorizedRegWrapper<2, c10::BFloat16>;
using INT8Vec16 = VectorizedRegWrapper<1, int8_t>;
using INT32Vec16 = VectorizedRegWrapper<4, int32_t>;

} // namespace vec_op

Import

#include "cpu/cpu_types_arm.hpp"

I/O Contract

Inputs

Name Type Required Description
ptr const void* Yes Pointer to source data for vector load operations
elem_num int No Number of elements for partial load/store operations
v scalar type No Scalar value to broadcast into all vector lanes

Outputs

Name Type Description
Vector register NxVectorizedTVecReg<N, T> SIMD vector containing loaded/computed elements

Usage Examples

// Load 16 floats from memory into a vector
vec_op::FP32Vec16 vec(input_ptr);

// Perform element-wise multiply
vec_op::FP32Vec16 result = vec * scale_vec;

// Convert FP32 to BF16 and save
vec_op::BF16Vec16 bf16_result(result);
bf16_result.save(output_ptr);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment