Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Prometheus Metrics Endpoint

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Observability, Operations
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for accessing Prometheus-format metrics from a vLLM inference server provided by the vllm library.

Description

vLLM exposes a /metrics HTTP endpoint that returns all server metrics in Prometheus text exposition format. These metrics are registered with the prometheus_client Python library and updated in real time by the engine's stat loggers.

For programmatic access within the same process, the get_metrics_snapshot() function in vllm/v1/metrics/reader.py provides a Python API that collects all vLLM-namespaced metrics from the Prometheus registry and returns them as typed dataclass objects (Counter, Gauge, Histogram).

The metrics cover the full lifecycle of request processing:

  • Scheduler state: vllm:num_requests_running, vllm:num_requests_waiting
  • Latency histograms: vllm:time_to_first_token_seconds, vllm:e2e_request_latency_seconds, vllm:inter_token_latency_seconds
  • Resource utilization: vllm:kv_cache_usage_perc
  • Request parameters: vllm:request_params_max_tokens
  • Queue time: vllm:request_queue_time_seconds

All metrics use the vllm: namespace prefix and support per-engine labels for data-parallel deployments.

Usage

Use the /metrics endpoint for external monitoring by configuring Prometheus to scrape the vLLM server. Use get_metrics_snapshot() for in-process metrics access, such as building custom dashboards or autoscaling logic.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/v1/metrics/reader.py (Lines 70-143)
  • Metric definitions: vllm/v1/metrics/loggers.py (Lines 428-860+)

Signature

# HTTP endpoint (no import needed; accessed via HTTP GET)
# GET http://localhost:8000/metrics
# Returns: Prometheus text exposition format

# Python API for in-process access
def get_metrics_snapshot() -> list[Metric]:
    """Collect all vLLM-namespaced metrics from the Prometheus registry."""
    ...

Import

from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram

I/O Contract

Inputs

Name Type Required Description
HTTP GET /metrics HTTP request N/A No parameters required. The endpoint returns all registered metrics.
(no arguments) N/A N/A get_metrics_snapshot() takes no arguments; it reads from the global Prometheus registry.

Outputs

Name Type Description
/metrics response text/plain Prometheus text exposition format with all vLLM metrics.
list[Metric] Gauge | Histogram] Typed Python objects from get_metrics_snapshot().

Key Metrics:

Metric Name Type Description
vllm:num_requests_running Gauge Number of requests currently in model execution batches.
vllm:num_requests_waiting Gauge Number of requests waiting in the queue to be processed.
vllm:time_to_first_token_seconds Histogram Distribution of time from request arrival to first generated token. Buckets range from 0.001s to 2560s.
vllm:e2e_request_latency_seconds Histogram Distribution of total end-to-end request latency in seconds.
vllm:inter_token_latency_seconds Histogram Distribution of time between successive generated tokens.
vllm:kv_cache_usage_perc Gauge Fraction of KV-cache memory currently in use (0.0-1.0).
vllm:request_queue_time_seconds Histogram Distribution of time requests spend in the WAITING phase.
vllm:request_inference_time_seconds Histogram Distribution of time requests spend in the RUNNING phase.
vllm:request_params_max_tokens Histogram Distribution of the max_tokens request parameter values.

Usage Examples

Scraping Metrics with curl

# Fetch all metrics from a running vLLM server
curl http://localhost:8000/metrics

# Example output (truncated):
# # HELP vllm:num_requests_running Number of requests in model execution batches.
# # TYPE vllm:num_requests_running gauge
# vllm:num_requests_running{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 3.0
# # HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# # TYPE vllm:num_requests_waiting gauge
# vllm:num_requests_waiting{engine="0",model_name="meta-llama/Llama-2-7b-chat-hf"} 12.0

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: "vllm"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:8000"]

Python In-Process Metrics Access

from vllm.v1.metrics.reader import get_metrics_snapshot, Counter, Gauge, Histogram

# Collect a snapshot of all vLLM metrics
metrics = get_metrics_snapshot()

for metric in metrics:
    if isinstance(metric, Gauge):
        print(f"{metric.name} [{metric.labels}] = {metric.value}")
    elif isinstance(metric, Counter):
        print(f"{metric.name} [{metric.labels}] = {metric.value}")
    elif isinstance(metric, Histogram):
        print(f"{metric.name} [{metric.labels}]")
        print(f"    count = {metric.count}")
        print(f"    sum = {metric.sum}")
        for bucket_le, count in metric.buckets.items():
            print(f"    le={bucket_le}: {count}")

Monitoring KV-Cache and Queue Depth

from vllm.v1.metrics.reader import get_metrics_snapshot, Gauge

def check_server_health():
    """Check if the server is under pressure."""
    metrics = get_metrics_snapshot()

    for metric in metrics:
        if isinstance(metric, Gauge):
            if metric.name == "vllm:kv_cache_usage_perc":
                if metric.value > 0.95:
                    print(f"WARNING: KV-cache usage at {metric.value:.1%}")
            elif metric.name == "vllm:num_requests_waiting":
                if metric.value > 50:
                    print(f"WARNING: {int(metric.value)} requests queued")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment