Implementation:Vllm project Vllm Ops Header

Knowledge Sources	vllm
Domains	API, Attention, Quantization, MoE, Activation
Last Updated	2026-02-08 00:00 GMT

Overview

Central C++ header declaring all vLLM custom GPU operations, serving as the contract between the Python frontend and the C++/CUDA backend.

Description

This header file declares function signatures for every custom operation in vLLM's CUDA extension library. It covers paged attention (v1 and v2), RMS normalization, rotary embeddings, activation functions (SiLU, GELU, FATReLU), quantization kernels (FP8, INT8, FP4, AWQ, GPTQ), CUTLASS scaled matrix multiplication, MoE operations, cache management, custom all-reduce, selective scan (Mamba), and sparse attention. The file also includes a utility function weak_ref_tensor for creating non-owning CUDA tensor views. Many declarations are conditionally compiled with USE_ROCM guards for AMD GPU compatibility.

Usage

This header is included by the PyTorch C++ extension registration code (torch_bindings.cpp) and by individual kernel implementation files. It provides the unified interface that PyTorch's torch.ops mechanism uses to dispatch custom operations to their CUDA implementations.

Code Reference

Source Location

Repository: vllm
File: csrc/ops.h
Lines: 1-406

Signature

// Utility
torch::Tensor weak_ref_tensor(torch::Tensor& tensor);

// Paged Attention
void paged_attention_v1(torch::Tensor& out, torch::Tensor& query, ...);
void paged_attention_v2(torch::Tensor& out, torch::Tensor& exp_sums, ...);
void merge_attn_states(torch::Tensor& output, ...);

// Normalization
void rms_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight, double epsilon);
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual, torch::Tensor& weight, double epsilon);

// Activations
void silu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_tanh_and_mul(torch::Tensor& out, torch::Tensor& input);

// Rotary Embedding
void rotary_embedding(torch::Tensor& positions, torch::Tensor& query, ...);

// Quantization
void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);
void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor& scale);
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);

// CUTLASS Operations
void cutlass_scaled_mm(torch::Tensor& out, torch::Tensor const& a, torch::Tensor const& b, ...);
void cutlass_scaled_mm_azp(torch::Tensor& out, ...);
void cutlass_moe_mm(torch::Tensor& out_tensors, ...);

// Custom All-Reduce
fptr_t init_custom_ar(const std::vector<int64_t>& fake_ipc_ptrs, ...);
void all_reduce(fptr_t _fa, torch::Tensor& inp, torch::Tensor& out, ...);

Import

#include "ops.h"

I/O Contract

Inputs

Name	Type	Required	Description
torch::Tensor arguments	torch::Tensor / torch::Tensor&	Varies	Input/output tensors passed by reference; specific shapes depend on the operation
scalar parameters	int64_t, double, bool, string	Varies	Configuration values such as epsilon, block_size, scale factors
optional tensors	std::optional<torch::Tensor>	No	Optional inputs like alibi_slopes, bias, azp, scale_ub
vllm::ScalarType	ScalarType	Varies	Scalar type descriptors for quantized operations

Outputs

Name	Type	Description
out / output tensors	torch::Tensor& (in-place)	Most operations write results into pre-allocated output tensors passed by reference
return values	torch::Tensor / bool / int64_t	Some operations return new tensors (e.g., awq_gemm, gptq_gemm) or status values

Usage Examples

// Paged Attention V1
torch::Tensor out = torch::empty({num_seqs, num_heads, head_size}, opts);
paged_attention_v1(out, query, key_cache, value_cache,
                   num_kv_heads, scale, block_tables, seq_lens,
                   block_size, max_seq_len, alibi_slopes,
                   "auto", k_scale, v_scale, tp_rank, 0, 0, 0, 0);

// RMS Normalization
rms_norm(output, input, weight, 1e-6);

// CUTLASS Scaled Matrix Multiply
cutlass_scaled_mm(out, a, b, a_scales, b_scales, bias);

// FP8 Quantization
dynamic_scaled_fp8_quant(quantized_out, input, scale);

// SiLU and Mul activation
silu_and_mul(activated_out, gate_up_input);

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment