Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Torch Bindings

From Leeroopedia


Knowledge Sources
Domains PyTorch_Bindings, Quantization, Attention, Activation
Last Updated 2026-02-08 00:00 GMT

Overview

Registers all vLLM CUDA/ROCm custom operations to PyTorch's operator registry via TORCH_LIBRARY_EXPAND, exposing over 100 native kernels to Python.

Description

This file is the primary extension registration point for vLLM's C++ backend. It uses PyTorch's TORCH_LIBRARY_EXPAND macro to define operation schemas (with typed tensor arguments and return types) and bind their CUDA implementations. The registered operations span attention (paged_attention_v1/v2, merge_attn_states), activations (silu_and_mul, gelu variants, fatrelu), normalization (rms_norm, fused_add_rms_norm), rotary embedding, quantization (awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm, fp8/nvfp4 quantization), cache management (reshape_and_cache, swap_blocks), mixture-of-experts (fused_moe), sampling, and embedding operations. Conditional compilation guards (USE_ROCM) selectively exclude CUDA-only or ROCm-only operations.

Usage

This file is compiled as part of the vLLM PyTorch C++ extension (typically named _C or _vllm_C). The registered operations become accessible from Python via torch.ops.{TORCH_EXTENSION_NAME}.{op_name}() after the extension is loaded.

Code Reference

Source Location

Signature

#include "cache.h"
#include "cuda_utils.h"
#include "ops.h"
#include "core/registration.h"
#include <torch/library.h>

TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
  // Attention ops
  ops.def("paged_attention_v1(...) -> ()");
  ops.impl("paged_attention_v1", torch::kCUDA, &paged_attention_v1);

  ops.def("paged_attention_v2(...) -> ()");
  ops.impl("paged_attention_v2", torch::kCUDA, &paged_attention_v2);

  ops.def("merge_attn_states(...) -> ()");
  ops.impl("merge_attn_states", torch::kCUDA, &merge_attn_states);

  // Activation ops
  ops.def("silu_and_mul(Tensor! result, Tensor input) -> ()");
  ops.impl("silu_and_mul", torch::kCUDA, &silu_and_mul);

  ops.def("gelu_and_mul(Tensor! out, Tensor input) -> ()");
  ops.impl("gelu_and_mul", torch::kCUDA, &gelu_and_mul);

  // Normalization ops
  ops.def("rms_norm(Tensor! result, Tensor input, Tensor weight, "
          "float epsilon) -> ()");
  ops.impl("rms_norm", torch::kCUDA, &rms_norm);

  // Quantization ops
  ops.def("awq_gemm(...) -> Tensor");
  ops.impl("awq_gemm", torch::kCUDA, &awq_gemm);

  ops.def("awq_dequantize(...) -> Tensor");
  ops.impl("awq_dequantize", torch::kCUDA, &awq_dequantize);

  // Rotary embedding
  ops.def("rotary_embedding(...) -> ()");
  ops.impl("rotary_embedding", torch::kCUDA, &rotary_embedding);

  // ... 100+ additional operations
}

Import

// This file is not imported directly; it is compiled into the
// vLLM PyTorch extension. From Python:
import vllm._custom_ops as ops
// or directly:
torch.ops._C.silu_and_mul(result, input)

I/O Contract

Inputs

Name Type Required Description
TORCH_EXTENSION_NAME macro Yes Name of the PyTorch extension library (defined at compile time)
ops torch::Library& Yes PyTorch library object used to register operation schemas and implementations

Outputs

Name Type Description
Registered ops PyTorch custom operators All vLLM operations registered and accessible via torch.ops namespace

Registered Operation Categories

Category Key Operations Description
Attention paged_attention_v1, paged_attention_v2, merge_attn_states PagedAttention with block-based KV cache
Activation silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, gelu_quick, fatrelu_and_mul Fused activation functions for GLU variants
Normalization rms_norm, fused_add_rms_norm, rms_norm_static_fp8_quant RMS normalization with optional quantization fusion
Rotary Embedding rotary_embedding GPT-NeoX / GPT-J style positional encoding
Quantization awq_gemm, awq_dequantize, marlin_gemm, machete_gemm, gptq_gemm Quantized matrix multiplication kernels
Cache reshape_and_cache, swap_blocks, copy_blocks KV cache management operations
Sampling top_k_per_row_prefill, top_k_per_row_decode Token sampling utilities

Usage Examples

# From Python after loading the vLLM extension
import torch
from vllm import _custom_ops as ops

# Call a registered operation
output = torch.empty_like(input)
ops.silu_and_mul(output, input)

# PagedAttention v1
ops.paged_attention_v1(
    out, query, key_cache, value_cache,
    num_kv_heads, scale, block_tables,
    seq_lens, block_size, max_seq_len,
    alibi_slopes, kv_cache_dtype,
    k_scale, v_scale, tp_rank,
    blocksparse_local_blocks,
    blocksparse_vert_stride,
    blocksparse_block_size,
    blocksparse_head_sliding_step
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment