Implementation:Vllm project Vllm Torch Bindings

Knowledge Sources	vllm
Domains	PyTorch_Bindings, Quantization, Attention, Activation
Last Updated	2026-02-08 00:00 GMT

Overview

Registers all vLLM CUDA/ROCm custom operations to PyTorch's operator registry via TORCH_LIBRARY_EXPAND, exposing over 100 native kernels to Python.

Description

This file is the primary extension registration point for vLLM's C++ backend. It uses PyTorch's TORCH_LIBRARY_EXPAND macro to define operation schemas (with typed tensor arguments and return types) and bind their CUDA implementations. The registered operations span attention (paged_attention_v1/v2, merge_attn_states), activations (silu_and_mul, gelu variants, fatrelu), normalization (rms_norm, fused_add_rms_norm), rotary embedding, quantization (awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm, fp8/nvfp4 quantization), cache management (reshape_and_cache, swap_blocks), mixture-of-experts (fused_moe), sampling, and embedding operations. Conditional compilation guards (USE_ROCM) selectively exclude CUDA-only or ROCm-only operations.

Usage

This file is compiled as part of the vLLM PyTorch C++ extension (typically named _C or _vllm_C). The registered operations become accessible from Python via torch.ops.{TORCH_EXTENSION_NAME}.{op_name}() after the extension is loaded.

Code Reference

Source Location

Repository: vllm
File: csrc/torch_bindings.cpp
Lines: 1-839

Signature

#include "cache.h"
#include "cuda_utils.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>

TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  // Attention ops
  ops.def("paged_attention_v1(...) -> ()");
  ops.impl("paged_attention_v1", torch::kCUDA, &paged_attention_v1);

  ops.def("paged_attention_v2(...) -> ()");
  ops.impl("paged_attention_v2", torch::kCUDA, &paged_attention_v2);

  ops.def("merge_attn_states(...) -> ()");
  ops.impl("merge_attn_states", torch::kCUDA, &merge_attn_states);

  // Activation ops
  ops.def("silu_and_mul(Tensor! result, Tensor input) -> ()");
  ops.impl("silu_and_mul", torch::kCUDA, &silu_and_mul);

  ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.impl("gelu_and_mul", torch::kCUDA, &gelu_and_mul);

  // Normalization ops
  ops.def("rms_norm(Tensor! result, Tensor input, Tensor weight, "
          "float epsilon) -> ()");
  ops.impl("rms_norm", torch::kCUDA, &rms_norm);

  // Quantization ops
  ops.def("awq_gemm(...) -> Tensor");
  ops.impl("awq_gemm", torch::kCUDA, &awq_gemm);

  ops.def("awq_dequantize(...) -> Tensor");
  ops.impl("awq_dequantize", torch::kCUDA, &awq_dequantize);

  // Rotary embedding
  ops.def("rotary_embedding(...) -> ()");
  ops.impl("rotary_embedding", torch::kCUDA, &rotary_embedding);

  // ... 100+ additional operations
}

Import

// This file is not imported directly; it is compiled into the
// vLLM PyTorch extension. From Python:
import vllm._custom_ops as ops
// or directly:
torch.ops._C.silu_and_mul(result, input)

I/O Contract

Inputs

Name	Type	Required	Description
TORCH_EXTENSION_NAME	macro	Yes	Name of the PyTorch extension library (defined at compile time)
ops	torch::Library&	Yes	PyTorch library object used to register operation schemas and implementations

Outputs

Name	Type	Description
Registered ops	PyTorch custom operators	All vLLM operations registered and accessible via torch.ops namespace

Registered Operation Categories

Category	Key Operations	Description
Attention	paged_attention_v1, paged_attention_v2, merge_attn_states	PagedAttention with block-based KV cache
Activation	silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, gelu_quick, fatrelu_and_mul	Fused activation functions for GLU variants
Normalization	rms_norm, fused_add_rms_norm, rms_norm_static_fp8_quant	RMS normalization with optional quantization fusion
Rotary Embedding	rotary_embedding	GPT-NeoX / GPT-J style positional encoding
Quantization	awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm	Quantized matrix multiplication kernels
Cache	reshape_and_cache, swap_blocks, copy_blocks	KV cache management operations
Sampling	top_k_per_row_prefill, top_k_per_row_decode	Token sampling utilities

Usage Examples

# From Python after loading the vLLM extension
import torch
from vllm import _custom_ops as ops

# Call a registered operation
output = torch.empty_like(input)
ops.silu_and_mul(output, input)

# PagedAttention v1
ops.paged_attention_v1(
    out, query, key_cache, value_cache,
    num_kv_heads, scale, block_tables,
    seq_lens, block_size, max_seq_len,
    alibi_slopes, kv_cache_dtype,
    k_scale, v_scale, tp_rank,
    blocksparse_local_blocks,
    blocksparse_vert_stride,
    blocksparse_block_size,
    blocksparse_head_sliding_step
)

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment