Implementation:Vllm project Vllm RequestOutput Access

Knowledge Sources	vLLM vLLM Docs
Domains	Machine Learning, Natural Language Processing, Software Engineering
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for accessing and interpreting generation results provided by vLLM's output data classes.

Description

vLLM represents generation results using two primary dataclasses:

RequestOutput: The top-level container for a single request's results. It holds the original prompt, prompt token IDs, prompt log probabilities, a list of CompletionOutput objects, and metadata such as the request ID, finish status, and performance metrics.

CompletionOutput: Represents a single generated completion within a request. When n=1 (the default), there is one CompletionOutput per request. When n > 1, each request contains multiple CompletionOutput objects indexed by their position.

Both classes are Python @dataclass instances (for CompletionOutput) or plain classes (for RequestOutput), making attribute access straightforward.

Usage

Access these objects after every call to LLM.generate() or LLM.chat(). The return value is always list[RequestOutput], with one entry per input prompt, in the same order as the inputs.

Code Reference

Source Location

Repository: vllm
File: vllm/outputs.py
Lines: 22-193

Signature

@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    routed_experts: np.ndarray | None = None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        lora_request: LoRARequest | None = None,
        encoder_prompt: str | None = None,
        encoder_prompt_token_ids: list[int] | None = None,
        num_cached_tokens: int | None = None,
        *,
        multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
        kv_transfer_params: dict[str, Any] | None = None,
        **kwargs: Any,
    ) -> None

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

The output objects are returned by LLM.generate() and LLM.chat(); they are not directly constructed by the user. The relevant attributes for reading are:

RequestOutput Attributes

Name	Type	Description
request_id	str	Unique identifier for this request
prompt	str or None	The original prompt string
prompt_token_ids	list[int] or None	Token IDs of the prompt
prompt_logprobs	PromptLogprobs or None	Log probabilities for prompt tokens (if requested)
outputs	list[CompletionOutput]	List of generated completions (length equals the `n` parameter)
finished	bool	Whether the request has fully completed
metrics	RequestStateStats or None	Performance metrics for the request
num_cached_tokens	int or None	Number of tokens served from the prefix cache

CompletionOutput Attributes

Name	Type	Description
index	int	Index of this completion among the `n` outputs for its request
text	str	The generated text
token_ids	Sequence[int]	Token IDs of the generated text
cumulative_logprob	float or None	Sum of log probabilities of all generated tokens
logprobs	SampleLogprobs or None	Per-token log probabilities (if requested via `logprobs` parameter)
finish_reason	str or None	Why generation stopped: "stop" (natural end or stop sequence) or "length" (hit max_tokens)
stop_reason	int, str, or None	The specific stop string or token ID that caused stopping, or None

Outputs

Name	Type	Description
(attribute access)	various	The individual fields described above, accessed via standard Python attribute notation

Usage Examples

Basic Text Extraction

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

outputs = llm.generate(["What is Python?"], params)

# Access the first (and only) request's first completion
request_output = outputs[0]
completion = request_output.outputs[0]
print(completion.text)
print(f"Finish reason: {completion.finish_reason}")

Processing Batch Results

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)

prompts = [
    "Define machine learning:",
    "Explain neural networks:",
    "What is deep learning?",
]

outputs = llm.generate(prompts, params)

for i, output in enumerate(outputs):
    prompt = output.prompt
    generated_text = output.outputs[0].text
    finish_reason = output.outputs[0].finish_reason
    num_tokens = len(output.outputs[0].token_ids)
    print(f"Prompt: {prompt}")
    print(f"Response ({num_tokens} tokens, {finish_reason}): {generated_text}\n")

Multiple Completions per Prompt

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(n=3, temperature=0.9, max_tokens=64)

outputs = llm.generate(["Write a tagline for a coffee shop:"], params)

request_output = outputs[0]
for completion in request_output.outputs:
    print(f"  Completion {completion.index}: {completion.text}")
    print(f"    Cumulative logprob: {completion.cumulative_logprob}")

Extracting Log Probabilities

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(
    temperature=0,
    max_tokens=32,
    logprobs=5,  # Return top-5 log probs per token
)

outputs = llm.generate(["The capital of France is"], params)

completion = outputs[0].outputs[0]
print(f"Generated: {completion.text}")
print(f"Cumulative logprob: {completion.cumulative_logprob}")

# Inspect per-token log probabilities
if completion.logprobs:
    for step, token_logprobs in enumerate(completion.logprobs):
        print(f"  Step {step}: {token_logprobs}")

Checking Finish Reasons for Quality Control

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100, stop=["\n\n"])

outputs = llm.generate(["Summarize the benefits of exercise:"], params)

completion = outputs[0].outputs[0]
if completion.finish_reason == "stop":
    print("Complete response:", completion.text)
elif completion.finish_reason == "length":
    print("Truncated response (consider increasing max_tokens):", completion.text)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Output_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment