Implementation:Vllm project Vllm LLM Get Metrics

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Inference, Speculative Decoding, Observability
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for retrieving speculative decoding performance metrics from the vLLM engine's Prometheus registry provided by vLLM.

Description

The LLM.get_metrics() method returns a snapshot of all vLLM-namespaced Prometheus metrics as a list of typed dataclass instances. For speculative decoding, the relevant metrics are Counter objects (for total draft counts and accepted token counts) and Vector objects (for per-position acceptance counts). The method delegates to get_metrics_snapshot() in vllm/v1/metrics/reader.py, which iterates over the Prometheus REGISTRY, filters for metrics prefixed with vllm:, and converts raw Prometheus samples into typed Python dataclasses.

The metric types are:

Counter: A monotonically increasing integer counter with a value field
Vector: An ordered array of integer counters with a values field (used specifically for per-position acceptance counts)
Gauge: A floating-point value that can increase or decrease
Histogram: Bucketed observations with count, sum, and buckets fields

This method is only available with the V1 LLM engine.

Usage

Use this method after running speculative inference to evaluate the performance of the speculation strategy. The metrics enable computing acceptance rates, mean acceptance length, and per-position acceptance profiles that guide tuning of num_speculative_tokens and method selection.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py:L1651-1656 (LLM.get_metrics), vllm/v1/metrics/reader.py:L11-143 (metric types and get_metrics_snapshot)

Signature

# LLM.get_metrics() in vllm/entrypoints/llm.py
def get_metrics(self) -> list[Metric]:
    """Return a snapshot of aggregated metrics from Prometheus.

    Returns:
        A list of Metric instances capturing the current state
        of all aggregated metrics from Prometheus.

    Note:
        This method is only available with the V1 LLM engine.
    """
    return self.llm_engine.get_metrics()

# Metric base class and subclasses in vllm/v1/metrics/reader.py
@dataclass
class Metric:
    name: str
    labels: dict[str, str]

@dataclass
class Counter(Metric):
    value: int

@dataclass
class Vector(Metric):
    values: list[int]

@dataclass
class Gauge(Metric):
    value: float

@dataclass
class Histogram(Metric):
    count: int
    sum: float
    buckets: dict[str, int]

Import

from vllm import LLM
from vllm.v1.metrics.reader import (
    get_metrics_snapshot,
    Counter,
    Vector,
    Gauge,
    Histogram,
)

I/O Contract

Inputs

Name	Type	Required	Description
(none)			This method takes no arguments. It reads from the in-process Prometheus registry.

Outputs

Name	Type	Description
metrics	`list[Metric]`	A list of metric objects. Each is one of `Counter`, `Vector`, `Gauge`, or `Histogram`. All metrics have `name` and `labels` fields. Speculative decoding metrics are prefixed with `vllm:spec_decode_`.

The speculative decoding-specific metrics in the output are:

Metric Name	Type	Fields	Description
`vllm:spec_decode_num_drafts`	`Counter`	`value: int`	Total number of draft rounds executed.
`vllm:spec_decode_num_draft_tokens`	`Counter`	`value: int`	Total number of draft tokens proposed.
`vllm:spec_decode_num_accepted_tokens`	`Counter`	`value: int`	Total number of draft tokens accepted by the target model.
`vllm:spec_decode_num_accepted_tokens_per_pos`	`Vector`	`values: list[int]`	Accepted token count at each speculation position. Index 0 = first position, index K-1 = last position.

Usage Examples

Computing Overall Acceptance Rate

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
    disable_log_stats=False,
)

# Run inference
outputs = llm.generate(
    ["Explain quantum computing."],
    SamplingParams(temperature=0, max_tokens=256),
)

# Collect metrics
metrics = llm.get_metrics()

num_drafts = 0
num_draft_tokens = 0
num_accepted_tokens = 0

for metric in metrics:
    if metric.name == "vllm:spec_decode_num_drafts":
        assert isinstance(metric, Counter)
        num_drafts += metric.value
    elif metric.name == "vllm:spec_decode_num_draft_tokens":
        assert isinstance(metric, Counter)
        num_draft_tokens += metric.value
    elif metric.name == "vllm:spec_decode_num_accepted_tokens":
        assert isinstance(metric, Counter)
        num_accepted_tokens += metric.value

# Compute derived metrics
if num_drafts > 0:
    acceptance_rate = num_accepted_tokens / num_draft_tokens
    mean_acceptance_length = 1 + (num_accepted_tokens / num_drafts)
    print(f"Acceptance rate: {acceptance_rate:.2%}")
    print(f"Mean acceptance length: {mean_acceptance_length:.2f}")

Per-Position Acceptance Analysis

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 5,
    },
    disable_log_stats=False,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=256))
metrics = llm.get_metrics()

num_drafts = 0
acceptance_counts = [0] * 5  # num_speculative_tokens = 5

for metric in metrics:
    if metric.name == "vllm:spec_decode_num_drafts":
        assert isinstance(metric, Counter)
        num_drafts += metric.value
    elif metric.name == "vllm:spec_decode_num_accepted_tokens_per_pos":
        assert isinstance(metric, Vector)
        for pos in range(len(metric.values)):
            acceptance_counts[pos] += metric.values[pos]

# Print per-position acceptance rates
if num_drafts > 0:
    for i, count in enumerate(acceptance_counts):
        rate = count / num_drafts
        print(f"Position {i}: acceptance rate = {rate:.2%}")
    # Example output:
    # Position 0: acceptance rate = 88.50%
    # Position 1: acceptance rate = 82.30%
    # Position 2: acceptance rate = 74.10%
    # Position 3: acceptance rate = 63.20%
    # Position 4: acceptance rate = 51.80%

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Speculation_Metrics_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment