Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm CPU Types X86

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, SIMD, x86
Last Updated 2026-02-08 00:00 GMT

Overview

Defines x86 AVX2/AVX-512 vector types using Intel intrinsics for high-performance SIMD operations on Intel and AMD CPUs.

Description

This header provides the core vectorized data types (FP16Vec8, FP16Vec16, BF16Vec8, BF16Vec16, BF16Vec32, FP32Vec4, FP32Vec8, FP32Vec16, INT8Vec16, INT8Vec64, INT32Vec16) built on __m128i, __m256i, __m256, and __m512 registers using immintrin.h intrinsics. It requires AVX2 as a minimum and optionally leverages AVX-512 features including masked stores (_mm256_mask_storeu_epi16), non-temporal loads (_mm256_stream_load_si256), and FP8 (e5m2) support. The implementation includes a RDTSCP-based benchmarking timestamp utility and dispatch macros for Float, BFloat16, Half, and Float8_e5m2 scalar types.

Usage

This header is conditionally included when compiling vLLM on x86_64 platforms with AVX2 support. It is the primary SIMD backend for Intel/AMD server CPUs and provides the performance-critical vector primitives used throughout the CPU backend for attention, GEMM, activation, and quantization kernels.

Code Reference

Source Location

Signature

namespace vec_op {

struct FP16Vec8 : public Vec<FP16Vec8> {
    constexpr static int VEC_ELEM_NUM = 8;
    __m128i reg;
    explicit FP16Vec8(const void* ptr);
    explicit FP16Vec8(const FP32Vec8&);
    void save(void* ptr) const;
};

struct FP16Vec16 : public Vec<FP16Vec16> {
    constexpr static int VEC_ELEM_NUM = 16;
    __m256i reg;
    explicit FP16Vec16(const void* ptr);
    explicit FP16Vec16(const FP32Vec16&);
    void save(void* ptr) const;
    void save(void* ptr, const int elem_num) const;
};

struct BF16Vec16 : public Vec<BF16Vec16> {
    constexpr static int VEC_ELEM_NUM = 16;
    __m256i reg;
    explicit BF16Vec16(const void* ptr);
    explicit BF16Vec16(const FP32Vec16&);
    void save(void* ptr) const;
    void save(void* ptr, const int elem_num) const;
};

struct FP32Vec16 : public Vec<FP32Vec16> {
    constexpr static int VEC_ELEM_NUM = 16;
    __m512 reg;
    explicit FP32Vec16(const void* ptr);
    explicit FP32Vec16(float v);
    FP32Vec16 operator*(const FP32Vec16&) const;
    FP32Vec16 operator+(const FP32Vec16&) const;
    FP32Vec16 operator-(const FP32Vec16&) const;
    float reduce_sum() const;
};

FORCE_INLINE uint64_t bench_timestamp();

} // namespace vec_op

Import

#include "cpu/cpu_types_x86.hpp"

I/O Contract

Inputs

Name Type Required Description
ptr const void* Yes Pointer to source data for vector load via _mm256_loadu / _mm512_loadu intrinsics
v float / c10::Half / c10::BFloat16 No Scalar value to broadcast via _mm256_set1 / _mm512_set1 into all vector lanes
elem_num int No Number of elements for masked partial store via AVX-512 mask operations

Outputs

Name Type Description
Vector struct FP32Vec16, BF16Vec16, FP16Vec16, etc. AVX2/AVX-512 register-backed vector containing SIMD-computed elements

Usage Examples

// Load 16 floats using AVX-512
vec_op::FP32Vec16 vec(input_ptr);

// Perform SIMD multiply
vec_op::FP32Vec16 result = vec * scale_vec;

// Convert FP32 to BF16 and save with element count
vec_op::BF16Vec16 bf16_result(result);
bf16_result.save(output_ptr, 12);  // save only first 12 elements

// Horizontal sum reduction
float total = result.reduce_sum();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment