Principle:Vllm project Vllm Server Metrics Monitoring

Knowledge Sources	vLLM vLLM Docs Prometheus
Domains	LLM Serving, Observability, Operations
Last Updated	2026-02-08 13:00 GMT

Overview

Server metrics monitoring is the practice of collecting, exposing, and analyzing quantitative measurements of an inference server's health, performance, and resource utilization in real time.

Description

Operating an LLM inference server in production requires continuous visibility into how the system is performing. Without metrics, operators cannot distinguish between a healthy server handling its expected load and one that is silently degrading due to memory pressure, queue buildup, or hardware issues.

A comprehensive metrics system for LLM serving captures several categories of information:

Request-level metrics: How many requests are currently running, waiting in the queue, or have completed. These reveal whether the server is keeping up with demand.
Latency metrics: Time-to-first-token (TTFB), inter-token latency, end-to-end request latency, and queue wait time. These directly impact user experience.
Throughput metrics: Prompt tokens processed per second and generation tokens produced per second. These indicate the overall efficiency of the deployment.
Resource utilization: KV-cache usage percentage, GPU memory utilization, and batch sizes. These help operators tune configuration and plan capacity.
Error metrics: Counts of failed or corrupted requests, preemptions, and other anomalies.

Metrics are typically exposed in Prometheus format via a /metrics HTTP endpoint, enabling integration with standard monitoring stacks (Prometheus + Grafana). The three Prometheus metric types used are:

Gauges: Current values that can go up or down (e.g., number of running requests).
Counters: Monotonically increasing totals (e.g., total requests completed).
Histograms: Distribution of observed values in configurable buckets (e.g., latency percentiles).

Usage

Use server metrics monitoring when:

Deploying vLLM in production and needing SLA compliance visibility.
Diagnosing performance issues such as high latency, low throughput, or memory pressure.
Making scaling decisions (adding more GPUs, adjusting batch sizes, or tuning KV-cache allocation).
Setting up alerting rules (e.g., alert when queue depth exceeds a threshold or TTFB exceeds a target).
Comparing the impact of configuration changes (quantization, tensor parallelism, sequence length limits).

Theoretical Basis

Metrics monitoring for inference servers draws on established observability principles:

RED method: Rate (requests per second), Errors (failed requests), and Duration (latency distribution). These three signals provide a minimal but sufficient view of any request-driven service's health.
USE method: Utilization (KV-cache usage, GPU memory), Saturation (queue depth), and Errors. This complements RED by focusing on resource consumption.
Histogram quantiles: Rather than tracking only average latency (which hides outliers), histograms record the full distribution. Prometheus histograms use pre-configured bucket boundaries to compute approximate percentiles (p50, p95, p99) at query time.
Time-to-first-token (TTFB): In autoregressive LLM serving, TTFB measures the time from request arrival to the first generated token. This is dominated by the prefill phase (processing the input prompt through all transformer layers). TTFB is distinct from inter-token latency (time between successive tokens during decoding) and end-to-end latency (total time from request to final token).
Continuous batching effects: Because vLLM dynamically adds and removes requests from the running batch, metrics like "number of running requests" fluctuate at each scheduling step. Monitoring these at sufficient resolution reveals batching efficiency.

Related Pages

Implemented By

Implementation:Vllm_project_Vllm_Prometheus_Metrics_Endpoint

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment