Environment:Vllm project Vllm CPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, C++_Runtime |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
C++ CPU runtime environment for vLLM's native CPU inference backend, providing optimized kernel implementations for attention, activation, normalization, quantization, and mixture-of-experts operations across x86_64, AArch64, VSX (POWER), and VXE (s390x) architectures.
Description
This environment encompasses the C++ compilation and runtime infrastructure required to build and execute vLLM's CPU-specific kernels. The CPU backend is an alternative to the CUDA/ROCm GPU backends and targets production deployments where GPU hardware is unavailable or cost-prohibitive. The backend leverages architecture-specific SIMD intrinsics (AVX2, AVX-512, NEON, SVE, VSX, VXE) to maximize throughput on modern server CPUs. Key components include fused attention dispatchers, MoE (Mixture-of-Experts) kernels, layer normalization, positional encoding, shared-memory IPC primitives, and weight-only quantization (WNA16) kernels. Thread-level parallelism is achieved via OpenMP and pthreads. The Intel oneDNN (DNNL) library provides optimized GEMM and convolution primitives on x86_64 platforms. The SGL (ScaleLLM GEMM Library) integration provides high-performance GEMM kernels for FP8 and INT8 quantized inference.
Usage
To use the CPU backend, build vLLM with VLLM_TARGET_DEVICE=cpu. The build system auto-detects available ISA extensions via compiler feature tests and selects the appropriate SIMD code paths. At runtime, OpenMP thread count should be tuned via OMP_NUM_THREADS to match the physical core count. Shared-memory operations (CPU_SHM) require POSIX shared memory support (/dev/shm) for inter-process tensor exchange in multi-worker configurations.
Requirements
| Requirement | Value |
|---|---|
| C++ Standard | C++17 or later |
| Compiler | GCC >= 9.0 or Clang >= 10.0 with OpenMP support |
| Threading | OpenMP 4.5+ and pthreads |
| ISA Extensions (x86_64) | AVX2 (minimum), AVX-512 (recommended for best performance) |
| ISA Extensions (AArch64) | NEON (minimum), SVE (recommended) |
| ISA Extensions (POWER) | VSX (Vector Scalar Extension) |
| ISA Extensions (s390x) | VXE (Vector Extension for z/Architecture) |
| oneDNN (DNNL) | Intel oneDNN >= 3.0 (x86_64 only, for optimized GEMM) |
| CMake | >= 3.26.1 |
| Build System | Ninja (recommended) |
| Shared Memory | POSIX shared memory (/dev/shm) for CPU_SHM
|
| Operating System | Linux (Ubuntu 20.04+, CentOS 7+) |
Semantic Links
- Implementation:Vllm_project_Vllm_CPU_Activation
- Implementation:Vllm_project_Vllm_CPU_Attn_Dispatcher
- Implementation:Vllm_project_Vllm_CPU_Float_Convert
- Implementation:Vllm_project_Vllm_CPU_Fused_MoE
- Implementation:Vllm_project_Vllm_CPU_Layernorm
- Implementation:Vllm_project_Vllm_CPU_MLA_Decode
- Implementation:Vllm_project_Vllm_CPU_Pos_Encoding
- Implementation:Vllm_project_Vllm_CPU_SHM
- Implementation:Vllm_project_Vllm_CPU_Torch_Bindings
- Implementation:Vllm_project_Vllm_CPU_Types_ARM
- Implementation:Vllm_project_Vllm_CPU_Types_Scalar
- Implementation:Vllm_project_Vllm_CPU_Types_VSX
- Implementation:Vllm_project_Vllm_CPU_Types_VXE
- Implementation:Vllm_project_Vllm_CPU_Types_X86
- Implementation:Vllm_project_Vllm_CPU_Utils
- Implementation:Vllm_project_Vllm_CPU_WNA16
- Implementation:Vllm_project_Vllm_DNNL_Helper
- Implementation:Vllm_project_Vllm_DNNL_Kernels
- Implementation:Vllm_project_Vllm_SGL_GEMM
- Implementation:Vllm_project_Vllm_SGL_GEMM_FP8
- Implementation:Vllm_project_Vllm_SGL_GEMM_INT8
- Implementation:Vllm_project_Vllm_SGL_MoE
- Implementation:Vllm_project_Vllm_SGL_MoE_FP8
- Implementation:Vllm_project_Vllm_SGL_MoE_INT8
- Implementation:Vllm_project_Vllm_SGL_Vec