Implementation:Vllm project Vllm Prometheus Metrics Endpoint

Knowledge Sources	vLLM vLLM Docs Prometheus
Domains	LLM Serving, Observability, Operations
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for accessing Prometheus-format metrics from a vLLM inference server provided by the vllm library.

Description

vLLM exposes a /metrics HTTP endpoint that returns all server metrics in Prometheus text exposition format. These metrics are registered with the prometheus_client Python library and updated in real time by the engine's stat loggers.

For programmatic access within the same process, the get_metrics_snapshot() function in vllm/v1/metrics/reader.py provides a Python API that collects all vLLM-namespaced metrics from the Prometheus registry and returns them as typed dataclass objects (Counter, Gauge, Histogram).

The metrics cover the full lifecycle of request processing:

Scheduler state: vllm:num_requests_running, vllm:num_requests_waiting
Latency histograms: vllm:time_to_first_token_seconds, vllm:e2e_request_latency_seconds, vllm:inter_token_latency_seconds
Resource utilization: vllm:kv_cache_usage_perc
Request parameters: vllm:request_params_max_tokens
Queue time: vllm:request_queue_time_seconds

All metrics use the vllm: namespace prefix and support per-engine labels for data-parallel deployments.

Usage

Use the /metrics endpoint for external monitoring by configuring Prometheus to scrape the vLLM server. Use get_metrics_snapshot() for in-process metrics access, such as building custom dashboards or autoscaling logic.

Code Reference

Source Location

Repository: vllm
File: vllm/v1/metrics/reader.py (Lines 70-143)
Metric definitions: vllm/v1/metrics/loggers.py (Lines 428-860+)

Signature

# HTTP endpoint (no import needed; accessed via HTTP GET)
# GET http://localhost:8000/metrics
# Returns: Prometheus text exposition format

# Python API for in-process access
def get_metrics_snapshot() -> list[Metric]:
    """Collect all vLLM-namespaced metrics from the Prometheus registry."""
    ...

Import

from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram

I/O Contract

Inputs

Name	Type	Required	Description
HTTP GET /metrics	`HTTP request`	N/A	No parameters required. The endpoint returns all registered metrics.
(no arguments)	N/A	N/A	`get_metrics_snapshot()` takes no arguments; it reads from the global Prometheus registry.

Outputs

Name	Type	Description
/metrics response	`text/plain`	Prometheus text exposition format with all vLLM metrics.
list[Metric]	Gauge \| Histogram]	Typed Python objects from `get_metrics_snapshot()`.

Key Metrics:

Metric Name	Type	Description
`vllm:num_requests_running`	Gauge	Number of requests currently in model execution batches.
`vllm:num_requests_waiting`	Gauge	Number of requests waiting in the queue to be processed.
`vllm:time_to_first_token_seconds`	Histogram	Distribution of time from request arrival to first generated token. Buckets range from 0.001s to 2560s.
`vllm:e2e_request_latency_seconds`	Histogram	Distribution of total end-to-end request latency in seconds.
`vllm:inter_token_latency_seconds`	Histogram	Distribution of time between successive generated tokens.
`vllm:kv_cache_usage_perc`	Gauge	Fraction of KV-cache memory currently in use (0.0-1.0).
`vllm:request_queue_time_seconds`	Histogram	Distribution of time requests spend in the WAITING phase.
`vllm:request_inference_time_seconds`	Histogram	Distribution of time requests spend in the RUNNING phase.
`vllm:request_params_max_tokens`	Histogram	Distribution of the max_tokens request parameter values.

Usage Examples

Scraping Metrics with curl

# Fetch all metrics from a running vLLM server
curl http://localhost:8000/metrics

# Example output (truncated):
# # HELP vllm:num_requests_running Number of requests in model execution batches.
# # TYPE vllm:num_requests_running gauge
# vllm:num_requests_running{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 3.0
# # HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# # TYPE vllm:num_requests_waiting gauge
# vllm:num_requests_waiting{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 12.0

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: "vllm"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:8000"]

Python In-Process Metrics Access

from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram

# Collect a snapshot of all vLLM metrics
metrics = get_metrics_snapshot()

for metric in metrics:
    if isinstance(metric, Gauge):
        print(f"{metric.name} [{metric.labels}] = {metric.value}")
    elif isinstance(metric, Counter):
        print(f"{metric.name} [{metric.labels}] = {metric.value}")
    elif isinstance(metric, Histogram):
        print(f"{metric.name} [{metric.labels}]")
        print(f"    count = {metric.count}")
        print(f"    sum = {metric.sum}")
        for bucket_le, count in metric.buckets.items():
            print(f"    le={bucket_le}: {count}")

Monitoring KV-Cache and Queue Depth

from vllm.v1.metrics.reader import get_metrics_snapshot, Gauge

def check_server_health():
    """Check if the server is under pressure."""
    metrics = get_metrics_snapshot()

    for metric in metrics:
        if isinstance(metric, Gauge):
            if metric.name == "vllm:kv_cache_usage_perc":
                if metric.value > 0.95:
                    print(f"WARNING: KV-cache usage at {metric.value:.1%}")
            elif metric.name == "vllm:num_requests_waiting":
                if metric.value > 50:
                    print(f"WARNING: {int(metric.value)} requests queued")

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Server_Metrics_Monitoring

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment