Implementation:Vllm project Vllm CPU Utils
| Knowledge Sources | |
|---|---|
| Domains | NUMA, CPU_Inference, Thread_Management |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements CPU thread affinity binding and NUMA-aware memory management for optimizing multi-threaded inference performance on multi-socket CPU systems.
Description
This file provides the init_cpu_threads_env function that configures thread-to-core affinity and NUMA memory policies for the vLLM CPU backend. It parses a CPU ID string to determine which cores to bind to, migrates existing memory pages to the appropriate NUMA nodes, and sets memory allocation policies (MEMBIND for single-node, INTERLEAVE for multi-node). Additionally, it includes the ScratchPadManager class for managing aligned temporary memory allocations used by CPU kernels.
Usage
The init_cpu_threads_env function is called during CPU backend initialization from Python to configure optimal thread placement. When NUMA is disabled at build time (VLLM_NUMA_DISABLED), the function returns a warning string without performing any binding. The ScratchPadManager provides a singleton scratch buffer for temporary kernel computations.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/utils.cpp
- Lines: 1-188
Signature
std::string init_cpu_threads_env(const std::string& cpu_ids);
namespace cpu_utils {
class ScratchPadManager {
public:
ScratchPadManager();
void realloc(size_t new_size);
static ScratchPadManager* get_scratchpad_manager();
private:
size_t size_;
void* ptr_;
};
} // namespace cpu_utils
Import
#include "cpu/utils.hpp"
#include <numa.h> // when NUMA is enabled
#include <sched.h> // for sched_setaffinity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cpu_ids | std::string | Yes | Comma-separated CPU core IDs or ranges (e.g., "0-3,8-11") parsed by numa_parse_cpustring_all |
Outputs
| Name | Type | Description |
|---|---|---|
| return | std::string | Diagnostic string describing OMP thread-to-core binding (process ID, thread ID, core mappings) |
Usage Examples
// Bind OMP threads to specific CPU cores on NUMA node 0
std::string result = init_cpu_threads_env("0-7");
// Output: "OMP threads binding of Process 12345:\n\tOMP tid: 12346, core 0\n..."
// Multi-NUMA node binding with interleave policy
std::string result = init_cpu_threads_env("0-3,16-19");
// Memory allocation will be interleaved across NUMA nodes containing those cores
// ScratchPad usage in CPU kernels
auto* manager = cpu_utils::ScratchPadManager::get_scratchpad_manager();
manager->realloc(1024 * 1024); // Ensure at least 1MB scratch space