Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Ops Header

From Leeroopedia


Knowledge Sources
Domains API, Attention, Quantization, MoE, Activation
Last Updated 2026-02-08 00:00 GMT

Overview

Central C++ header declaring all vLLM custom GPU operations, serving as the contract between the Python frontend and the C++/CUDA backend.

Description

This header file declares function signatures for every custom operation in vLLM's CUDA extension library. It covers paged attention (v1 and v2), RMS normalization, rotary embeddings, activation functions (SiLU, GELU, FATReLU), quantization kernels (FP8, INT8, FP4, AWQ, GPTQ), CUTLASS scaled matrix multiplication, MoE operations, cache management, custom all-reduce, selective scan (Mamba), and sparse attention. The file also includes a utility function weak_ref_tensor for creating non-owning CUDA tensor views. Many declarations are conditionally compiled with USE_ROCM guards for AMD GPU compatibility.

Usage

This header is included by the PyTorch C++ extension registration code (torch_bindings.cpp) and by individual kernel implementation files. It provides the unified interface that PyTorch's torch.ops mechanism uses to dispatch custom operations to their CUDA implementations.

Code Reference

Source Location

Signature

// Utility
torch::Tensor weak_ref_tensor(torch::Tensor& tensor);

// Paged Attention
void paged_attention_v1(torch::Tensor& out, torch::Tensor& query, ...);
void paged_attention_v2(torch::Tensor& out, torch::Tensor& exp_sums, ...);
void merge_attn_states(torch::Tensor& output, ...);

// Normalization
void rms_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight, double epsilon);
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual, torch::Tensor& weight, double epsilon);

// Activations
void silu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_and_mul(torch::Tensor& out, torch::Tensor& input);
void gelu_tanh_and_mul(torch::Tensor& out, torch::Tensor& input);

// Rotary Embedding
void rotary_embedding(torch::Tensor& positions, torch::Tensor& query, ...);

// Quantization
void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);
void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor& scale);
void static_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale, ...);

// CUTLASS Operations
void cutlass_scaled_mm(torch::Tensor& out, torch::Tensor const& a, torch::Tensor const& b, ...);
void cutlass_scaled_mm_azp(torch::Tensor& out, ...);
void cutlass_moe_mm(torch::Tensor& out_tensors, ...);

// Custom All-Reduce
fptr_t init_custom_ar(const std::vector<int64_t>& fake_ipc_ptrs, ...);
void all_reduce(fptr_t _fa, torch::Tensor& inp, torch::Tensor& out, ...);

Import

#include "ops.h"

I/O Contract

Inputs

Name Type Required Description
torch::Tensor arguments torch::Tensor / torch::Tensor& Varies Input/output tensors passed by reference; specific shapes depend on the operation
scalar parameters int64_t, double, bool, string Varies Configuration values such as epsilon, block_size, scale factors
optional tensors std::optional<torch::Tensor> No Optional inputs like alibi_slopes, bias, azp, scale_ub
vllm::ScalarType ScalarType Varies Scalar type descriptors for quantized operations

Outputs

Name Type Description
out / output tensors torch::Tensor& (in-place) Most operations write results into pre-allocated output tensors passed by reference
return values torch::Tensor / bool / int64_t Some operations return new tensors (e.g., awq_gemm, gptq_gemm) or status values

Usage Examples

// Paged Attention V1
torch::Tensor out = torch::empty({num_seqs, num_heads, head_size}, opts);
paged_attention_v1(out, query, key_cache, value_cache,
                   num_kv_heads, scale, block_tables, seq_lens,
                   block_size, max_seq_len, alibi_slopes,
                   "auto", k_scale, v_scale, tp_rank, 0, 0, 0, 0);

// RMS Normalization
rms_norm(output, input, weight, 1e-6);

// CUTLASS Scaled Matrix Multiply
cutlass_scaled_mm(out, a, b, a_scales, b_scales, bias);

// FP8 Quantization
dynamic_scaled_fp8_quant(quantized_out, input, scale);

// SiLU and Mul activation
silu_and_mul(activated_out, gate_up_input);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment