Implementation:Vllm project Vllm RequestOutput Access
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Software Engineering |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for accessing and interpreting generation results provided by vLLM's output data classes.
Description
vLLM represents generation results using two primary dataclasses:
RequestOutput: The top-level container for a single request's results. It holds the original prompt, prompt token IDs, prompt log probabilities, a list ofCompletionOutputobjects, and metadata such as the request ID, finish status, and performance metrics.
CompletionOutput: Represents a single generated completion within a request. Whenn=1(the default), there is oneCompletionOutputper request. Whenn > 1, each request contains multipleCompletionOutputobjects indexed by their position.
Both classes are Python @dataclass instances (for CompletionOutput) or plain classes (for RequestOutput), making attribute access straightforward.
Usage
Access these objects after every call to LLM.generate() or LLM.chat(). The return value is always list[RequestOutput], with one entry per input prompt, in the same order as the inputs.
Code Reference
Source Location
- Repository: vllm
- File: vllm/outputs.py
- Lines: 22-193
Signature
@dataclass
class CompletionOutput:
index: int
text: str
token_ids: Sequence[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
routed_experts: np.ndarray | None = None
finish_reason: str | None = None
stop_reason: int | str | None = None
lora_request: LoRARequest | None = None
class RequestOutput:
def __init__(
self,
request_id: str,
prompt: str | None,
prompt_token_ids: list[int] | None,
prompt_logprobs: PromptLogprobs | None,
outputs: list[CompletionOutput],
finished: bool,
metrics: RequestStateStats | None = None,
lora_request: LoRARequest | None = None,
encoder_prompt: str | None = None,
encoder_prompt_token_ids: list[int] | None = None,
num_cached_tokens: int | None = None,
*,
multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
kv_transfer_params: dict[str, Any] | None = None,
**kwargs: Any,
) -> None
Import
from vllm.outputs import RequestOutput, CompletionOutput
I/O Contract
Inputs
The output objects are returned by LLM.generate() and LLM.chat(); they are not directly constructed by the user. The relevant attributes for reading are:
RequestOutput Attributes
| Name | Type | Description |
|---|---|---|
| request_id | str | Unique identifier for this request |
| prompt | str or None | The original prompt string |
| prompt_token_ids | list[int] or None | Token IDs of the prompt |
| prompt_logprobs | PromptLogprobs or None | Log probabilities for prompt tokens (if requested) |
| outputs | list[CompletionOutput] | List of generated completions (length equals the n parameter)
|
| finished | bool | Whether the request has fully completed |
| metrics | RequestStateStats or None | Performance metrics for the request |
| num_cached_tokens | int or None | Number of tokens served from the prefix cache |
CompletionOutput Attributes
| Name | Type | Description |
|---|---|---|
| index | int | Index of this completion among the n outputs for its request
|
| text | str | The generated text |
| token_ids | Sequence[int] | Token IDs of the generated text |
| cumulative_logprob | float or None | Sum of log probabilities of all generated tokens |
| logprobs | SampleLogprobs or None | Per-token log probabilities (if requested via logprobs parameter)
|
| finish_reason | str or None | Why generation stopped: "stop" (natural end or stop sequence) or "length" (hit max_tokens) |
| stop_reason | int, str, or None | The specific stop string or token ID that caused stopping, or None |
Outputs
| Name | Type | Description |
|---|---|---|
| (attribute access) | various | The individual fields described above, accessed via standard Python attribute notation |
Usage Examples
Basic Text Extraction
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(["What is Python?"], params)
# Access the first (and only) request's first completion
request_output = outputs[0]
completion = request_output.outputs[0]
print(completion.text)
print(f"Finish reason: {completion.finish_reason}")
Processing Batch Results
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=128)
prompts = [
"Define machine learning:",
"Explain neural networks:",
"What is deep learning?",
]
outputs = llm.generate(prompts, params)
for i, output in enumerate(outputs):
prompt = output.prompt
generated_text = output.outputs[0].text
finish_reason = output.outputs[0].finish_reason
num_tokens = len(output.outputs[0].token_ids)
print(f"Prompt: {prompt}")
print(f"Response ({num_tokens} tokens, {finish_reason}): {generated_text}\n")
Multiple Completions per Prompt
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(n=3, temperature=0.9, max_tokens=64)
outputs = llm.generate(["Write a tagline for a coffee shop:"], params)
request_output = outputs[0]
for completion in request_output.outputs:
print(f" Completion {completion.index}: {completion.text}")
print(f" Cumulative logprob: {completion.cumulative_logprob}")
Extracting Log Probabilities
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(
temperature=0,
max_tokens=32,
logprobs=5, # Return top-5 log probs per token
)
outputs = llm.generate(["The capital of France is"], params)
completion = outputs[0].outputs[0]
print(f"Generated: {completion.text}")
print(f"Cumulative logprob: {completion.cumulative_logprob}")
# Inspect per-token log probabilities
if completion.logprobs:
for step, token_logprobs in enumerate(completion.logprobs):
print(f" Step {step}: {token_logprobs}")
Checking Finish Reasons for Quality Control
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=100, stop=["\n\n"])
outputs = llm.generate(["Summarize the benefits of exercise:"], params)
completion = outputs[0].outputs[0]
if completion.finish_reason == "stop":
print("Complete response:", completion.text)
elif completion.finish_reason == "length":
print("Truncated response (consider increasing max_tokens):", completion.text)