Implementation:Vllm project Vllm CPU Types X86
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, SIMD, x86 |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Defines x86 AVX2/AVX-512 vector types using Intel intrinsics for high-performance SIMD operations on Intel and AMD CPUs.
Description
This header provides the core vectorized data types (FP16Vec8, FP16Vec16, BF16Vec8, BF16Vec16, BF16Vec32, FP32Vec4, FP32Vec8, FP32Vec16, INT8Vec16, INT8Vec64, INT32Vec16) built on __m128i, __m256i, __m256, and __m512 registers using immintrin.h intrinsics. It requires AVX2 as a minimum and optionally leverages AVX-512 features including masked stores (_mm256_mask_storeu_epi16), non-temporal loads (_mm256_stream_load_si256), and FP8 (e5m2) support. The implementation includes a RDTSCP-based benchmarking timestamp utility and dispatch macros for Float, BFloat16, Half, and Float8_e5m2 scalar types.
Usage
This header is conditionally included when compiling vLLM on x86_64 platforms with AVX2 support. It is the primary SIMD backend for Intel/AMD server CPUs and provides the performance-critical vector primitives used throughout the CPU backend for attention, GEMM, activation, and quantization kernels.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/cpu_types_x86.hpp
- Lines: 1-802
Signature
namespace vec_op {
struct FP16Vec8 : public Vec<FP16Vec8> {
constexpr static int VEC_ELEM_NUM = 8;
__m128i reg;
explicit FP16Vec8(const void* ptr);
explicit FP16Vec8(const FP32Vec8&);
void save(void* ptr) const;
};
struct FP16Vec16 : public Vec<FP16Vec16> {
constexpr static int VEC_ELEM_NUM = 16;
__m256i reg;
explicit FP16Vec16(const void* ptr);
explicit FP16Vec16(const FP32Vec16&);
void save(void* ptr) const;
void save(void* ptr, const int elem_num) const;
};
struct BF16Vec16 : public Vec<BF16Vec16> {
constexpr static int VEC_ELEM_NUM = 16;
__m256i reg;
explicit BF16Vec16(const void* ptr);
explicit BF16Vec16(const FP32Vec16&);
void save(void* ptr) const;
void save(void* ptr, const int elem_num) const;
};
struct FP32Vec16 : public Vec<FP32Vec16> {
constexpr static int VEC_ELEM_NUM = 16;
__m512 reg;
explicit FP32Vec16(const void* ptr);
explicit FP32Vec16(float v);
FP32Vec16 operator*(const FP32Vec16&) const;
FP32Vec16 operator+(const FP32Vec16&) const;
FP32Vec16 operator-(const FP32Vec16&) const;
float reduce_sum() const;
};
FORCE_INLINE uint64_t bench_timestamp();
} // namespace vec_op
Import
#include "cpu/cpu_types_x86.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ptr | const void* | Yes | Pointer to source data for vector load via _mm256_loadu / _mm512_loadu intrinsics |
| v | float / c10::Half / c10::BFloat16 | No | Scalar value to broadcast via _mm256_set1 / _mm512_set1 into all vector lanes |
| elem_num | int | No | Number of elements for masked partial store via AVX-512 mask operations |
Outputs
| Name | Type | Description |
|---|---|---|
| Vector struct | FP32Vec16, BF16Vec16, FP16Vec16, etc. | AVX2/AVX-512 register-backed vector containing SIMD-computed elements |
Usage Examples
// Load 16 floats using AVX-512
vec_op::FP32Vec16 vec(input_ptr);
// Perform SIMD multiply
vec_op::FP32Vec16 result = vec * scale_vec;
// Convert FP32 to BF16 and save with element count
vec_op::BF16Vec16 bf16_result(result);
bf16_result.save(output_ptr, 12); // save only first 12 elements
// Horizontal sum reduction
float total = result.reduce_sum();