Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm LLM Get Metrics

From Leeroopedia


Knowledge Sources
Domains LLM Inference, Speculative Decoding, Observability
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for retrieving speculative decoding performance metrics from the vLLM engine's Prometheus registry provided by vLLM.

Description

The LLM.get_metrics() method returns a snapshot of all vLLM-namespaced Prometheus metrics as a list of typed dataclass instances. For speculative decoding, the relevant metrics are Counter objects (for total draft counts and accepted token counts) and Vector objects (for per-position acceptance counts). The method delegates to get_metrics_snapshot() in vllm/v1/metrics/reader.py, which iterates over the Prometheus REGISTRY, filters for metrics prefixed with vllm:, and converts raw Prometheus samples into typed Python dataclasses.

The metric types are:

  • Counter: A monotonically increasing integer counter with a value field
  • Vector: An ordered array of integer counters with a values field (used specifically for per-position acceptance counts)
  • Gauge: A floating-point value that can increase or decrease
  • Histogram: Bucketed observations with count, sum, and buckets fields

This method is only available with the V1 LLM engine.

Usage

Use this method after running speculative inference to evaluate the performance of the speculation strategy. The metrics enable computing acceptance rates, mean acceptance length, and per-position acceptance profiles that guide tuning of num_speculative_tokens and method selection.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py:L1651-1656 (LLM.get_metrics), vllm/v1/metrics/reader.py:L11-143 (metric types and get_metrics_snapshot)

Signature

# LLM.get_metrics() in vllm/entrypoints/llm.py
def get_metrics(self) -> list[Metric]:
    """Return a snapshot of aggregated metrics from Prometheus.

    Returns:
        A list of Metric instances capturing the current state
        of all aggregated metrics from Prometheus.

    Note:
        This method is only available with the V1 LLM engine.
    """
    return self.llm_engine.get_metrics()

# Metric base class and subclasses in vllm/v1/metrics/reader.py
@dataclass
class Metric:
    name: str
    labels: dict[str, str]

@dataclass
class Counter(Metric):
    value: int

@dataclass
class Vector(Metric):
    values: list[int]

@dataclass
class Gauge(Metric):
    value: float

@dataclass
class Histogram(Metric):
    count: int
    sum: float
    buckets: dict[str, int]

Import

from vllm import LLM
from vllm.v1.metrics.reader import (
    get_metrics_snapshot,
    Counter,
    Vector,
    Gauge,
    Histogram,
)

I/O Contract

Inputs

Name Type Required Description
(none) This method takes no arguments. It reads from the in-process Prometheus registry.

Outputs

Name Type Description
metrics list[Metric] A list of metric objects. Each is one of Counter, Vector, Gauge, or Histogram. All metrics have name and labels fields. Speculative decoding metrics are prefixed with vllm:spec_decode_.

The speculative decoding-specific metrics in the output are:

Metric Name Type Fields Description
vllm:spec_decode_num_drafts Counter value: int Total number of draft rounds executed.
vllm:spec_decode_num_draft_tokens Counter value: int Total number of draft tokens proposed.
vllm:spec_decode_num_accepted_tokens Counter value: int Total number of draft tokens accepted by the target model.
vllm:spec_decode_num_accepted_tokens_per_pos Vector values: list[int] Accepted token count at each speculation position. Index 0 = first position, index K-1 = last position.

Usage Examples

Computing Overall Acceptance Rate

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 3,
    },
    disable_log_stats=False,
)

# Run inference
outputs = llm.generate(
    ["Explain quantum computing."],
    SamplingParams(temperature=0, max_tokens=256),
)

# Collect metrics
metrics = llm.get_metrics()

num_drafts = 0
num_draft_tokens = 0
num_accepted_tokens = 0

for metric in metrics:
    if metric.name == "vllm:spec_decode_num_drafts":
        assert isinstance(metric, Counter)
        num_drafts += metric.value
    elif metric.name == "vllm:spec_decode_num_draft_tokens":
        assert isinstance(metric, Counter)
        num_draft_tokens += metric.value
    elif metric.name == "vllm:spec_decode_num_accepted_tokens":
        assert isinstance(metric, Counter)
        num_accepted_tokens += metric.value

# Compute derived metrics
if num_drafts > 0:
    acceptance_rate = num_accepted_tokens / num_draft_tokens
    mean_acceptance_length = 1 + (num_accepted_tokens / num_drafts)
    print(f"Acceptance rate: {acceptance_rate:.2%}")
    print(f"Mean acceptance length: {mean_acceptance_length:.2f}")

Per-Position Acceptance Analysis

from vllm import LLM, SamplingParams
from vllm.v1.metrics.reader import Counter, Vector

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 5,
    },
    disable_log_stats=False,
)

outputs = llm.generate(prompts, SamplingParams(temperature=0, max_tokens=256))
metrics = llm.get_metrics()

num_drafts = 0
acceptance_counts = [0] * 5  # num_speculative_tokens = 5

for metric in metrics:
    if metric.name == "vllm:spec_decode_num_drafts":
        assert isinstance(metric, Counter)
        num_drafts += metric.value
    elif metric.name == "vllm:spec_decode_num_accepted_tokens_per_pos":
        assert isinstance(metric, Vector)
        for pos in range(len(metric.values)):
            acceptance_counts[pos] += metric.values[pos]

# Print per-position acceptance rates
if num_drafts > 0:
    for i, count in enumerate(acceptance_counts):
        rate = count / num_drafts
        print(f"Position {i}: acceptance rate = {rate:.2%}")
    # Example output:
    # Position 0: acceptance rate = 88.50%
    # Position 1: acceptance rate = 82.30%
    # Position 2: acceptance rate = 74.10%
    # Position 3: acceptance rate = 63.20%
    # Position 4: acceptance rate = 51.80%

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment