Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm CPU Torch Bindings

From Leeroopedia


Knowledge Sources
Domains PyTorch Integration, CPU Backend
Last Updated 2026-02-08 00:00 GMT

Overview

Exports all CPU-optimized operators to PyTorch via TORCH_LIBRARY, binding C++ implementations of attention, cache, GEMM, MoE, quantization, and normalization operations to Python-callable ops.

Description

This file serves as the main integration point between the C++ CPU backend and the Python frontend in vLLM. It uses the TORCH_LIBRARY_EXPAND macro to register operator definitions and CPU-specific implementations for the full suite of vLLM custom operations. This includes activation functions (SiLU, GELU variants), normalization (RMS norm), rotary embeddings, quantization (oneDNN scaled matmul, INT8, FP8), attention with KV cache, MoE dispatch, and shared memory communication primitives. Conditional compilation gates are used for platform-specific operators (AVX512, AArch64, PowerPC).

Usage

This file is compiled as part of the vLLM CPU extension module. When the extension is loaded, PyTorch automatically discovers and registers the operators, making them available via torch.ops.{extension_name}.{op_name} from Python.

Code Reference

Source Location

Signature

// Key function declarations bound via TORCH_LIBRARY_EXPAND
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  // Activation ops
  ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.def("gelu_tanh_and_mul(Tensor! out, Tensor input) -> ()");

  // Normalization
  ops.def("rms_norm(Tensor! out, Tensor input, Tensor weight, float epsilon) -> ()");
  ops.def("fused_add_rms_norm(Tensor! input, Tensor! residual, Tensor weight, float epsilon) -> ()");

  // Rotary embedding
  ops.def("rotary_embedding(Tensor positions, Tensor! query, Tensor!? key, ...) -> ()");

  // Quantization (conditional on AVX512/AArch64/PowerPC)
  ops.def("onednn_scaled_mm(Tensor! c, Tensor a, Tensor a_scales, ...) -> ()");
  ops.def("static_scaled_int8_quant(Tensor! out, Tensor input, Tensor scale, Tensor? azp) -> ()");

  // MoE
  ops.def("fused_experts_cpu(...) -> Tensor");

  // Attention and KV cache
  ops.def("cpu_attention_with_kv_cache(...) -> ()");
  ops.def("cpu_attn_reshape_and_cache(...) -> ()");

  // Shared memory communication
  ops.def("init_shm_manager(...) -> int");
  ops.def("shm_allreduce(int handle, Tensor! data) -> ()");
};

Import

#include "cache.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>

I/O Contract

Inputs

Name Type Required Description
(various) torch::Tensor Yes Input tensors for each registered operation
(various) float/int64_t Yes Scalar parameters (epsilon, scale, dimensions, etc.)
(various) std::string No ISA hints and configuration strings

Outputs

Name Type Description
(various) torch::Tensor Output tensors, often modified in-place (marked with Tensor!)
(various) int64_t Handles for stateful objects (oneDNN handlers, SHM managers)

Usage Examples

// From Python, after loading the CPU extension:
// torch.ops.vllm.silu_and_mul(out, input)
// torch.ops.vllm.rms_norm(out, input, weight, epsilon)
// torch.ops.vllm.rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox)

// Registration pattern in C++:
ops.def("silu_and_mul(Tensor! out, Tensor input) -> ()");
ops.impl("silu_and_mul", torch::kCPU, &silu_and_mul);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment