Implementation:Vllm project Vllm Prometheus Metrics Endpoint
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Observability, Operations |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for accessing Prometheus-format metrics from a vLLM inference server provided by the vllm library.
Description
vLLM exposes a /metrics HTTP endpoint that returns all server metrics in Prometheus text exposition format. These metrics are registered with the prometheus_client Python library and updated in real time by the engine's stat loggers.
For programmatic access within the same process, the get_metrics_snapshot() function in vllm/v1/metrics/reader.py provides a Python API that collects all vLLM-namespaced metrics from the Prometheus registry and returns them as typed dataclass objects (Counter, Gauge, Histogram).
The metrics cover the full lifecycle of request processing:
- Scheduler state:
vllm:num_requests_running,vllm:num_requests_waiting - Latency histograms:
vllm:time_to_first_token_seconds,vllm:e2e_request_latency_seconds,vllm:inter_token_latency_seconds - Resource utilization:
vllm:kv_cache_usage_perc - Request parameters:
vllm:request_params_max_tokens - Queue time:
vllm:request_queue_time_seconds
All metrics use the vllm: namespace prefix and support per-engine labels for data-parallel deployments.
Usage
Use the /metrics endpoint for external monitoring by configuring Prometheus to scrape the vLLM server. Use get_metrics_snapshot() for in-process metrics access, such as building custom dashboards or autoscaling logic.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/v1/metrics/reader.py(Lines 70-143) - Metric definitions:
vllm/v1/metrics/loggers.py(Lines 428-860+)
Signature
# HTTP endpoint (no import needed; accessed via HTTP GET)
# GET http://localhost:8000/metrics
# Returns: Prometheus text exposition format
# Python API for in-process access
def get_metrics_snapshot() -> list[Metric]:
"""Collect all vLLM-namespaced metrics from the Prometheus registry."""
...
Import
from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| HTTP GET /metrics | HTTP request |
N/A | No parameters required. The endpoint returns all registered metrics. |
| (no arguments) | N/A | N/A | get_metrics_snapshot() takes no arguments; it reads from the global Prometheus registry.
|
Outputs
| Name | Type | Description |
|---|---|---|
| /metrics response | text/plain |
Prometheus text exposition format with all vLLM metrics. |
| list[Metric] | Gauge | Histogram] | Typed Python objects from get_metrics_snapshot().
|
Key Metrics:
| Metric Name | Type | Description |
|---|---|---|
vllm:num_requests_running |
Gauge | Number of requests currently in model execution batches. |
vllm:num_requests_waiting |
Gauge | Number of requests waiting in the queue to be processed. |
vllm:time_to_first_token_seconds |
Histogram | Distribution of time from request arrival to first generated token. Buckets range from 0.001s to 2560s. |
vllm:e2e_request_latency_seconds |
Histogram | Distribution of total end-to-end request latency in seconds. |
vllm:inter_token_latency_seconds |
Histogram | Distribution of time between successive generated tokens. |
vllm:kv_cache_usage_perc |
Gauge | Fraction of KV-cache memory currently in use (0.0-1.0). |
vllm:request_queue_time_seconds |
Histogram | Distribution of time requests spend in the WAITING phase. |
vllm:request_inference_time_seconds |
Histogram | Distribution of time requests spend in the RUNNING phase. |
vllm:request_params_max_tokens |
Histogram | Distribution of the max_tokens request parameter values. |
Usage Examples
Scraping Metrics with curl
# Fetch all metrics from a running vLLM server
curl http://localhost:8000/metrics
# Example output (truncated):
# # HELP vllm:num_requests_running Number of requests in model execution batches.
# # TYPE vllm:num_requests_running gauge
# vllm:num_requests_running{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 3.0
# # HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# # TYPE vllm:num_requests_waiting gauge
# vllm:num_requests_waiting{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 12.0
Prometheus Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: "vllm"
scrape_interval: 5s
static_configs:
- targets: ["localhost:8000"]
Python In-Process Metrics Access
from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram
# Collect a snapshot of all vLLM metrics
metrics = get_metrics_snapshot()
for metric in metrics:
if isinstance(metric, Gauge):
print(f"{metric.name} [{metric.labels}] = {metric.value}")
elif isinstance(metric, Counter):
print(f"{metric.name} [{metric.labels}] = {metric.value}")
elif isinstance(metric, Histogram):
print(f"{metric.name} [{metric.labels}]")
print(f" count = {metric.count}")
print(f" sum = {metric.sum}")
for bucket_le, count in metric.buckets.items():
print(f" le={bucket_le}: {count}")
Monitoring KV-Cache and Queue Depth
from vllm.v1.metrics.reader import get_metrics_snapshot, Gauge
def check_server_health():
"""Check if the server is under pressure."""
metrics = get_metrics_snapshot()
for metric in metrics:
if isinstance(metric, Gauge):
if metric.name == "vllm:kv_cache_usage_perc":
if metric.value > 0.95:
print(f"WARNING: KV-cache usage at {metric.value:.1%}")
elif metric.name == "vllm:num_requests_waiting":
if metric.value > 50:
print(f"WARNING: {int(metric.value)} requests queued")