Implementation:Vllm project Vllm LLM Get Metrics
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Speculative Decoding, Observability |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for retrieving speculative decoding performance metrics from the vLLM engine's Prometheus registry provided by vLLM.
Description
The LLM.get_metrics() method returns a snapshot of all vLLM-namespaced Prometheus metrics as a list of typed dataclass instances. For speculative decoding, the relevant metrics are Counter objects (for total draft counts and accepted token counts) and Vector objects (for per-position acceptance counts). The method delegates to get_metrics_snapshot() in vllm/v1/metrics/reader.py, which iterates over the Prometheus REGISTRY, filters for metrics prefixed with vllm:, and converts raw Prometheus samples into typed Python dataclasses.
The metric types are:
- Counter: A monotonically increasing integer counter with a
valuefield - Vector: An ordered array of integer counters with a
valuesfield (used specifically for per-position acceptance counts) - Gauge: A floating-point value that can increase or decrease
- Histogram: Bucketed observations with
count,sum, andbucketsfields
This method is only available with the V1 LLM engine.
Usage
Use this method after running speculative inference to evaluate the performance of the speculation strategy. The metrics enable computing acceptance rates, mean acceptance length, and per-position acceptance profiles that guide tuning of num_speculative_tokens and method selection.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/llm.py:L1651-1656(LLM.get_metrics),vllm/v1/metrics/reader.py:L11-143(metric types and get_metrics_snapshot)
Signature
# LLM.get_metrics() in vllm/entrypoints/llm.py
def get_metrics(self) -> list[Metric]:
"""Return a snapshot of aggregated metrics from Prometheus.
Returns:
A list of Metric instances capturing the current state
of all aggregated metrics from Prometheus.
Note:
This method is only available with the V1 LLM engine.
"""
return self.llm_engine.get_metrics()
# Metric base class and subclasses in vllm/v1/metrics/reader.py
@dataclass
class Metric:
name: str
labels: dict[str, str]
@dataclass
class Counter(Metric):
value: int
@dataclass
class Vector(Metric):
values: list[int]
@dataclass
class Gauge(Metric):
value: float
@dataclass
class Histogram(Metric):
count: int
sum: float
buckets: dict[str, int]
Import
from vllm import LLM
from vllm.v1.metrics.reader import (
get_metrics_snapshot,
Counter,
Vector,
Gauge,
Histogram,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | This method takes no arguments. It reads from the in-process Prometheus registry. |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | list[Metric] |
A list of metric objects. Each is one of Counter, Vector, Gauge, or Histogram. All metrics have name and labels fields. Speculative decoding metrics are prefixed with vllm:spec_decode_.
|
The speculative decoding-specific metrics in the output are:
| Metric Name | Type | Fields | Description |
|---|---|---|---|
vllm:spec_decode_num_drafts |
Counter |
value: int |
Total number of draft rounds executed. |
vllm:spec_decode_num_draft_tokens |
Counter |
value: int |
Total number of draft tokens proposed. |
vllm:spec_decode_num_accepted_tokens |
Counter |
value: int |
Total number of draft tokens accepted by the target model. |
vllm:spec_decode_num_accepted_tokens_per_pos |
Vector |
values: list[int] |
Accepted token count at each speculation position. Index 0 = first position, index K-1 = last position. |
Usage Examples
Computing Overall Acceptance Rate
from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "eagle",
"model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 3,
},
disable_log_stats=False,
)
# Run inference
outputs = llm.generate(
["Explain quantum computing."],
SamplingParams(temperature=0, max_tokens=256),
)
# Collect metrics
metrics = llm.get_metrics()
num_drafts = 0
num_draft_tokens = 0
num_accepted_tokens = 0
for metric in metrics:
if metric.name == "vllm:spec_decode_num_drafts":
assert isinstance(metric, Counter)
num_drafts += metric.value
elif metric.name == "vllm:spec_decode_num_draft_tokens":
assert isinstance(metric, Counter)
num_draft_tokens += metric.value
elif metric.name == "vllm:spec_decode_num_accepted_tokens":
assert isinstance(metric, Counter)
num_accepted_tokens += metric.value
# Compute derived metrics
if num_drafts > 0:
acceptance_rate = num_accepted_tokens / num_draft_tokens
mean_acceptance_length = 1 + (num_accepted_tokens / num_drafts)
print(f"Acceptance rate: {acceptance_rate:.2%}")
print(f"Mean acceptance length: {mean_acceptance_length:.2f}")
Per-Position Acceptance Analysis
from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 5,
},
disable_log_stats=False,
)
outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=256))
metrics = llm.get_metrics()
num_drafts = 0
acceptance_counts = [0] * 5 # num_speculative_tokens = 5
for metric in metrics:
if metric.name == "vllm:spec_decode_num_drafts":
assert isinstance(metric, Counter)
num_drafts += metric.value
elif metric.name == "vllm:spec_decode_num_accepted_tokens_per_pos":
assert isinstance(metric, Vector)
for pos in range(len(metric.values)):
acceptance_counts[pos] += metric.values[pos]
# Print per-position acceptance rates
if num_drafts > 0:
for i, count in enumerate(acceptance_counts):
rate = count / num_drafts
print(f"Position {i}: acceptance rate = {rate:.2%}")
# Example output:
# Position 0: acceptance rate = 88.50%
# Position 1: acceptance rate = 82.30%
# Position 2: acceptance rate = 74.10%
# Position 3: acceptance rate = 63.20%
# Position 4: acceptance rate = 51.80%