Implementation:Vllm project Vllm Ops Header
| Knowledge Sources | |
|---|---|
| Domains | API, Attention, Quantization, MoE, Activation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Central C++ header declaring all vLLM custom GPU operations, serving as the contract between the Python frontend and the C++/CUDA backend.
Description
This header file declares function signatures for every custom operation in vLLM's CUDA extension library. It covers paged attention (v1 and v2), RMS normalization, rotary embeddings, activation functions (SiLU, GELU, FATReLU), quantization kernels (FP8, INT8, FP4, AWQ, GPTQ), CUTLASS scaled matrix multiplication, MoE operations, cache management, custom all-reduce, selective scan (Mamba), and sparse attention. The file also includes a utility function weak_ref_tensor for creating non-owning CUDA tensor views. Many declarations are conditionally compiled with USE_ROCM guards for AMD GPU compatibility.
Usage
This header is included by the PyTorch C++ extension registration code (torch_bindings.cpp) and by individual kernel implementation files. It provides the unified interface that PyTorch's torch.ops mechanism uses to dispatch custom operations to their CUDA implementations.
Code Reference
Source Location
- Repository: vllm
- File: csrc/ops.h
- Lines: 1-406
Signature
// Utility
torch::Tensor weak_ref_tensor(torch::Tensor& tensor);
// Paged Attention
void paged_attention_v1(torch::Tensor& out, torch::Tensor& query, ...);
void paged_attention_v2(torch::Tensor& out, torch::Tensor& exp_sums, ...);
void merge_attn_states(torch::Tensor& output, ...);
// Normalization
void rms_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight, double epsilon);
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual, torch::Tensor& weight, double epsilon);
// Activations
void silu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_tanh_and_mul(torch::Tensor& out, torch::Tensor& input);
// Rotary Embedding
void rotary_embedding(torch::Tensor& positions, torch::Tensor& query, ...);
// Quantization
void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);
void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor& scale);
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);
// CUTLASS Operations
void cutlass_scaled_mm(torch::Tensor& out, torch::Tensor const& a, torch::Tensor const& b, ...);
void cutlass_scaled_mm_azp(torch::Tensor& out, ...);
void cutlass_moe_mm(torch::Tensor& out_tensors, ...);
// Custom All-Reduce
fptr_t init_custom_ar(const std::vector<int64_t>& fake_ipc_ptrs, ...);
void all_reduce(fptr_t _fa, torch::Tensor& inp, torch::Tensor& out, ...);
Import
#include "ops.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| torch::Tensor arguments | torch::Tensor / torch::Tensor& | Varies | Input/output tensors passed by reference; specific shapes depend on the operation |
| scalar parameters | int64_t, double, bool, string | Varies | Configuration values such as epsilon, block_size, scale factors |
| optional tensors | std::optional<torch::Tensor> | No | Optional inputs like alibi_slopes, bias, azp, scale_ub |
| vllm::ScalarType | ScalarType | Varies | Scalar type descriptors for quantized operations |
Outputs
| Name | Type | Description |
|---|---|---|
| out / output tensors | torch::Tensor& (in-place) | Most operations write results into pre-allocated output tensors passed by reference |
| return values | torch::Tensor / bool / int64_t | Some operations return new tensors (e.g., awq_gemm, gptq_gemm) or status values |
Usage Examples
// Paged Attention V1
torch::Tensor out = torch::empty({num_seqs, num_heads, head_size}, opts);
paged_attention_v1(out, query, key_cache, value_cache,
num_kv_heads, scale, block_tables, seq_lens,
block_size, max_seq_len, alibi_slopes,
"auto", k_scale, v_scale, tp_rank, 0, 0, 0, 0);
// RMS Normalization
rms_norm(output, input, weight, 1e-6);
// CUTLASS Scaled Matrix Multiply
cutlass_scaled_mm(out, a, b, a_scales, b_scales, bias);
// FP8 Quantization
dynamic_scaled_fp8_quant(quantized_out, input, scale);
// SiLU and Mul activation
silu_and_mul(activated_out, gate_up_input);