Implementation:Vllm project Vllm CPU Torch Bindings

Knowledge Sources	vllm
Domains	PyTorch Integration, CPU Backend
Last Updated	2026-02-08 00:00 GMT

Overview

Exports all CPU-optimized operators to PyTorch via TORCH_LIBRARY, binding C++ implementations of attention, cache, GEMM, MoE, quantization, and normalization operations to Python-callable ops.

Description

This file serves as the main integration point between the C++ CPU backend and the Python frontend in vLLM. It uses the TORCH_LIBRARY_EXPAND macro to register operator definitions and CPU-specific implementations for the full suite of vLLM custom operations. This includes activation functions (SiLU, GELU variants), normalization (RMS norm), rotary embeddings, quantization (oneDNN scaled matmul, INT8, FP8), attention with KV cache, MoE dispatch, and shared memory communication primitives. Conditional compilation gates are used for platform-specific operators (AVX512, AArch64, PowerPC).

Usage

This file is compiled as part of the vLLM CPU extension module. When the extension is loaded, PyTorch automatically discovers and registers the operators, making them available via torch.ops.{extension_name}.{op_name} from Python.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/torch_bindings.cpp
Lines: 1-341

Signature

// Key function declarations bound via TORCH_LIBRARY_EXPAND
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  // Activation ops
  ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.def("gelu_tanh_and_mul(Tensor! out, Tensor input) -> ()");

  // Normalization
  ops.def("rms_norm(Tensor! out, Tensor input, Tensor weight, float epsilon) -> ()");
  ops.def("fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, float epsilon) -> ()");

  // Rotary embedding
  ops.def("rotary_embedding(Tensor positions, Tensor! query, Tensor!? key, ...) -> ()");

  // Quantization (conditional on AVX512/AArch64/PowerPC)
  ops.def("onednn_scaled_mm(Tensor! c, Tensor a, Tensor a_scales, ...) -> ()");
  ops.def("static_scaled_int8_quant(Tensor! out, Tensor input, Tensor scale, Tensor? azp) -> ()");

  // MoE
  ops.def("fused_experts_cpu(...) -> Tensor");

  // Attention and KV cache
  ops.def("cpu_attention_with_kv_cache(...) -> ()");
  ops.def("cpu_attn_reshape_and_cache(...) -> ()");

  // Shared memory communication
  ops.def("init_shm_manager(...) -> int");
  ops.def("shm_allreduce(int handle, Tensor! data) -> ()");
};

Import

#include "cache.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>

I/O Contract

Inputs

Name	Type	Required	Description
(various)	torch::Tensor	Yes	Input tensors for each registered operation
(various)	float/int64_t	Yes	Scalar parameters (epsilon, scale, dimensions, etc.)
(various)	std::string	No	ISA hints and configuration strings

Outputs

Name	Type	Description
(various)	torch::Tensor	Output tensors, often modified in-place (marked with Tensor!)
(various)	int64_t	Handles for stateful objects (oneDNN handlers, SHM managers)

Usage Examples

// From Python, after loading the CPU extension:
// torch.ops.vllm.silu_and_mul(out, input)
// torch.ops.vllm.rms_norm(out, input, weight, epsilon)
// torch.ops.vllm.rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox)

// Registration pattern in C++:
ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
ops.impl("silu_and_mul", torch::kCPU, &silu_and_mul);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment