Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm RequestOutput Access

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Software Engineering
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for accessing and interpreting generation results provided by vLLM's output data classes.

Description

vLLM represents generation results using two primary dataclasses:

  • RequestOutput: The top-level container for a single request's results. It holds the original prompt, prompt token IDs, prompt log probabilities, a list of CompletionOutput objects, and metadata such as the request ID, finish status, and performance metrics.
  • CompletionOutput: Represents a single generated completion within a request. When n=1 (the default), there is one CompletionOutput per request. When n > 1, each request contains multiple CompletionOutput objects indexed by their position.

Both classes are Python @dataclass instances (for CompletionOutput) or plain classes (for RequestOutput), making attribute access straightforward.

Usage

Access these objects after every call to LLM.generate() or LLM.chat(). The return value is always list[RequestOutput], with one entry per input prompt, in the same order as the inputs.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/outputs.py
  • Lines: 22-193

Signature

@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    routed_experts: np.ndarray | None = None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        lora_request: LoRARequest | None = None,
        encoder_prompt: str | None = None,
        encoder_prompt_token_ids: list[int] | None = None,
        num_cached_tokens: int | None = None,
        *,
        multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
        kv_transfer_params: dict[str, Any] | None = None,
        **kwargs: Any,
    ) -> None

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

The output objects are returned by LLM.generate() and LLM.chat(); they are not directly constructed by the user. The relevant attributes for reading are:

RequestOutput Attributes

Name Type Description
request_id str Unique identifier for this request
prompt str or None The original prompt string
prompt_token_ids list[int] or None Token IDs of the prompt
prompt_logprobs PromptLogprobs or None Log probabilities for prompt tokens (if requested)
outputs list[CompletionOutput] List of generated completions (length equals the n parameter)
finished bool Whether the request has fully completed
metrics RequestStateStats or None Performance metrics for the request
num_cached_tokens int or None Number of tokens served from the prefix cache

CompletionOutput Attributes

Name Type Description
index int Index of this completion among the n outputs for its request
text str The generated text
token_ids Sequence[int] Token IDs of the generated text
cumulative_logprob float or None Sum of log probabilities of all generated tokens
logprobs SampleLogprobs or None Per-token log probabilities (if requested via logprobs parameter)
finish_reason str or None Why generation stopped: "stop" (natural end or stop sequence) or "length" (hit max_tokens)
stop_reason int, str, or None The specific stop string or token ID that caused stopping, or None

Outputs

Name Type Description
(attribute access) various The individual fields described above, accessed via standard Python attribute notation

Usage Examples

Basic Text Extraction

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

outputs = llm.generate(["What is Python?"], params)

# Access the first (and only) request's first completion
request_output = outputs[0]
completion = request_output.outputs[0]
print(completion.text)
print(f"Finish reason: {completion.finish_reason}")

Processing Batch Results

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

prompts = [
    "Define machine learning:",
    "Explain neural networks:",
    "What is deep learning?",
]

outputs = llm.generate(prompts, params)

for i, output in enumerate(outputs):
    prompt = output.prompt
    generated_text = output.outputs[0].text
    finish_reason = output.outputs[0].finish_reason
    num_tokens = len(output.outputs[0].token_ids)
    print(f"Prompt: {prompt}")
    print(f"Response ({num_tokens} tokens, {finish_reason}): {generated_text}\n")

Multiple Completions per Prompt

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(n=3, temperature=0.9, max_tokens=64)

outputs = llm.generate(["Write a tagline for a coffee shop:"], params)

request_output = outputs[0]
for completion in request_output.outputs:
    print(f"  Completion {completion.index}: {completion.text}")
    print(f"    Cumulative logprob: {completion.cumulative_logprob}")

Extracting Log Probabilities

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(
    temperature=0,
    max_tokens=32,
    logprobs=5,  # Return top-5 log probs per token
)

outputs = llm.generate(["The capital of France is"], params)

completion = outputs[0].outputs[0]
print(f"Generated: {completion.text}")
print(f"Cumulative logprob: {completion.cumulative_logprob}")

# Inspect per-token log probabilities
if completion.logprobs:
    for step, token_logprobs in enumerate(completion.logprobs):
        print(f"  Step {step}: {token_logprobs}")

Checking Finish Reasons for Quality Control

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100, stop=["\n\n"])

outputs = llm.generate(["Summarize the benefits of exercise:"], params)

completion = outputs[0].outputs[0]
if completion.finish_reason == "stop":
    print("Complete response:", completion.text)
elif completion.finish_reason == "length":
    print("Truncated response (consider increasing max_tokens):", completion.text)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment