Implementation:Vllm project Vllm CPU Torch Bindings
| Knowledge Sources | |
|---|---|
| Domains | PyTorch Integration, CPU Backend |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Exports all CPU-optimized operators to PyTorch via TORCH_LIBRARY, binding C++ implementations of attention, cache, GEMM, MoE, quantization, and normalization operations to Python-callable ops.
Description
This file serves as the main integration point between the C++ CPU backend and the Python frontend in vLLM. It uses the TORCH_LIBRARY_EXPAND macro to register operator definitions and CPU-specific implementations for the full suite of vLLM custom operations. This includes activation functions (SiLU, GELU variants), normalization (RMS norm), rotary embeddings, quantization (oneDNN scaled matmul, INT8, FP8), attention with KV cache, MoE dispatch, and shared memory communication primitives. Conditional compilation gates are used for platform-specific operators (AVX512, AArch64, PowerPC).
Usage
This file is compiled as part of the vLLM CPU extension module. When the extension is loaded, PyTorch automatically discovers and registers the operators, making them available via torch.ops.{extension_name}.{op_name} from Python.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/torch_bindings.cpp
- Lines: 1-341
Signature
// Key function declarations bound via TORCH_LIBRARY_EXPAND
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
// Activation ops
ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
ops.def("gelu_tanh_and_mul(Tensor! out, Tensor input) -> ()");
// Normalization
ops.def("rms_norm(Tensor! out, Tensor input, Tensor weight, float epsilon) -> ()");
ops.def("fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, float epsilon) -> ()");
// Rotary embedding
ops.def("rotary_embedding(Tensor positions, Tensor! query, Tensor!? key, ...) -> ()");
// Quantization (conditional on AVX512/AArch64/PowerPC)
ops.def("onednn_scaled_mm(Tensor! c, Tensor a, Tensor a_scales, ...) -> ()");
ops.def("static_scaled_int8_quant(Tensor! out, Tensor input, Tensor scale, Tensor? azp) -> ()");
// MoE
ops.def("fused_experts_cpu(...) -> Tensor");
// Attention and KV cache
ops.def("cpu_attention_with_kv_cache(...) -> ()");
ops.def("cpu_attn_reshape_and_cache(...) -> ()");
// Shared memory communication
ops.def("init_shm_manager(...) -> int");
ops.def("shm_allreduce(int handle, Tensor! data) -> ()");
};
Import
#include "cache.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (various) | torch::Tensor | Yes | Input tensors for each registered operation |
| (various) | float/int64_t | Yes | Scalar parameters (epsilon, scale, dimensions, etc.) |
| (various) | std::string | No | ISA hints and configuration strings |
Outputs
| Name | Type | Description |
|---|---|---|
| (various) | torch::Tensor | Output tensors, often modified in-place (marked with Tensor!) |
| (various) | int64_t | Handles for stateful objects (oneDNN handlers, SHM managers) |
Usage Examples
// From Python, after loading the CPU extension:
// torch.ops.vllm.silu_and_mul(out, input)
// torch.ops.vllm.rms_norm(out, input, weight, epsilon)
// torch.ops.vllm.rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox)
// Registration pattern in C++:
ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
ops.impl("silu_and_mul", torch::kCPU, &silu_and_mul);