Implementation:Vllm project Vllm Torch Bindings
| Knowledge Sources | |
|---|---|
| Domains | PyTorch_Bindings, Quantization, Attention, Activation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Registers all vLLM CUDA/ROCm custom operations to PyTorch's operator registry via TORCH_LIBRARY_EXPAND, exposing over 100 native kernels to Python.
Description
This file is the primary extension registration point for vLLM's C++ backend. It uses PyTorch's TORCH_LIBRARY_EXPAND macro to define operation schemas (with typed tensor arguments and return types) and bind their CUDA implementations. The registered operations span attention (paged_attention_v1/v2, merge_attn_states), activations (silu_and_mul, gelu variants, fatrelu), normalization (rms_norm, fused_add_rms_norm), rotary embedding, quantization (awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm, fp8/nvfp4 quantization), cache management (reshape_and_cache, swap_blocks), mixture-of-experts (fused_moe), sampling, and embedding operations. Conditional compilation guards (USE_ROCM) selectively exclude CUDA-only or ROCm-only operations.
Usage
This file is compiled as part of the vLLM PyTorch C++ extension (typically named _C or _vllm_C). The registered operations become accessible from Python via torch.ops.{TORCH_EXTENSION_NAME}.{op_name}() after the extension is loaded.
Code Reference
Source Location
- Repository: vllm
- File: csrc/torch_bindings.cpp
- Lines: 1-839
Signature
#include "cache.h"
#include "cuda_utils.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>
TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
// Attention ops
ops.def("paged_attention_v1(...) -> ()");
ops.impl("paged_attention_v1", torch::kCUDA, &paged_attention_v1);
ops.def("paged_attention_v2(...) -> ()");
ops.impl("paged_attention_v2", torch::kCUDA, &paged_attention_v2);
ops.def("merge_attn_states(...) -> ()");
ops.impl("merge_attn_states", torch::kCUDA, &merge_attn_states);
// Activation ops
ops.def("silu_and_mul(Tensor! result, Tensor input) -> ()");
ops.impl("silu_and_mul", torch::kCUDA, &silu_and_mul);
ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
ops.impl("gelu_and_mul", torch::kCUDA, &gelu_and_mul);
// Normalization ops
ops.def("rms_norm(Tensor! result, Tensor input, Tensor weight, "
"float epsilon) -> ()");
ops.impl("rms_norm", torch::kCUDA, &rms_norm);
// Quantization ops
ops.def("awq_gemm(...) -> Tensor");
ops.impl("awq_gemm", torch::kCUDA, &awq_gemm);
ops.def("awq_dequantize(...) -> Tensor");
ops.impl("awq_dequantize", torch::kCUDA, &awq_dequantize);
// Rotary embedding
ops.def("rotary_embedding(...) -> ()");
ops.impl("rotary_embedding", torch::kCUDA, &rotary_embedding);
// ... 100+ additional operations
}
Import
// This file is not imported directly; it is compiled into the
// vLLM PyTorch extension. From Python:
import vllm._custom_ops as ops
// or directly:
torch.ops._C.silu_and_mul(result, input)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| TORCH_EXTENSION_NAME | macro | Yes | Name of the PyTorch extension library (defined at compile time) |
| ops | torch::Library& | Yes | PyTorch library object used to register operation schemas and implementations |
Outputs
| Name | Type | Description |
|---|---|---|
| Registered ops | PyTorch custom operators | All vLLM operations registered and accessible via torch.ops namespace |
Registered Operation Categories
| Category | Key Operations | Description |
|---|---|---|
| Attention | paged_attention_v1, paged_attention_v2, merge_attn_states | PagedAttention with block-based KV cache |
| Activation | silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, gelu_quick, fatrelu_and_mul | Fused activation functions for GLU variants |
| Normalization | rms_norm, fused_add_rms_norm, rms_norm_static_fp8_quant | RMS normalization with optional quantization fusion |
| Rotary Embedding | rotary_embedding | GPT-NeoX / GPT-J style positional encoding |
| Quantization | awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm | Quantized matrix multiplication kernels |
| Cache | reshape_and_cache, swap_blocks, copy_blocks | KV cache management operations |
| Sampling | top_k_per_row_prefill, top_k_per_row_decode | Token sampling utilities |
Usage Examples
# From Python after loading the vLLM extension
import torch
from vllm import _custom_ops as ops
# Call a registered operation
output = torch.empty_like(input)
ops.silu_and_mul(output, input)
# PagedAttention v1
ops.paged_attention_v1(
out, query, key_cache, value_cache,
num_kv_heads, scale, block_tables,
seq_lens, block_size, max_seq_len,
alibi_slopes, kv_cache_dtype,
k_scale, v_scale, tp_rank,
blocksparse_local_blocks,
blocksparse_vert_stride,
blocksparse_block_size,
blocksparse_head_sliding_step
)