Implementation:Vllm project Vllm RequestOutput LoRA Access
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Model Adaptation, Output Processing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for accessing inference outputs with per-request LoRA adapter attribution provided by vllm.
Description
The RequestOutput class (defined at lines 86-193 of vllm/outputs.py) represents the output of a single completion request to the vLLM engine. It contains the request ID, original prompt, prompt token IDs, prompt log probabilities, a list of CompletionOutput objects, a finished flag, and a lora_request attribute identifying which LoRA adapter was used.
The CompletionOutput class (defined at lines 22-65 of vllm/outputs.py) represents one generated sequence within a request. It contains the generated text, token IDs, cumulative log probability, per-token log probabilities, finish reason, stop reason, and its own lora_request attribute for per-sequence adapter attribution.
Both classes are dataclasses that are returned by LLMEngine.step() during the inference loop. The RequestOutput.add() method supports merging incremental outputs for streaming scenarios.
Usage
Use these classes to inspect and process outputs from the vLLM engine. Check output.finished to determine if a request is complete. Access output.outputs to get the list of generated sequences. Check output.lora_request or completion.lora_request to identify which LoRA adapter produced the result.
Code Reference
Source Location
- Repository: vllm
- File: vllm/outputs.py (lines 86-193 for RequestOutput, lines 22-65 for CompletionOutput)
Signature
# RequestOutput (lines 86-193)
class RequestOutput:
request_id: str
prompt: str | None
prompt_token_ids: list[int] | None
prompt_logprobs: PromptLogprobs | None
outputs: list[CompletionOutput]
finished: bool
metrics: RequestStateStats | None = None
lora_request: LoRARequest | None = None
encoder_prompt: str | None = None
encoder_prompt_token_ids: list[int] | None = None
num_cached_tokens: int | None = None
# CompletionOutput (lines 22-65)
@dataclass
class CompletionOutput:
index: int
text: str
token_ids: Sequence[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
routed_experts: np.ndarray | None = None
finish_reason: str | None = None
stop_reason: int | str | None = None
lora_request: LoRARequest | None = None
Import
from vllm.outputs import RequestOutput, CompletionOutput
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (returned by engine) | -- | -- | RequestOutput and CompletionOutput objects are produced by LLMEngine.step(), not constructed directly by the user. |
Outputs (RequestOutput attributes)
| Name | Type | Description |
|---|---|---|
| request_id | str | The unique string identifier of the completed request |
| prompt | str or None | The original prompt string submitted with the request |
| prompt_token_ids | list[int] or None | Token IDs of the prompt after tokenization |
| prompt_logprobs | PromptLogprobs or None | Log probabilities for each prompt token (if requested) |
| outputs | list[CompletionOutput] | List of generated completion sequences (one per n value) |
| finished | bool | True when all output sequences have reached a stopping condition |
| metrics | RequestStateStats or None | Performance metrics for the request (time to first token, etc.) |
| lora_request | LoRARequest or None | The LoRA adapter used for this request, or None for base model |
| num_cached_tokens | int or None | Number of prompt tokens served from the prefix cache |
Outputs (CompletionOutput attributes)
| Name | Type | Description |
|---|---|---|
| index | int | Index of this completion within the request (0-based) |
| text | str | The generated output text |
| token_ids | Sequence[int] | Token IDs of the generated output |
| cumulative_logprob | float or None | Sum of log probabilities of all generated tokens |
| logprobs | SampleLogprobs or None | Per-token log probabilities of top candidates (if requested) |
| finish_reason | str or None | Why generation stopped: "stop", "length", or None if still in progress |
| stop_reason | int, str, or None | The specific stop token ID or string that triggered completion |
| lora_request | LoRARequest or None | The LoRA adapter that produced this specific completion |
Usage Examples
Process Multi-LoRA Outputs
from vllm import LLMEngine, RequestOutput
# In the engine processing loop
request_outputs: list[RequestOutput] = engine.step()
for request_output in request_outputs:
if request_output.finished:
# Identify which adapter was used
adapter_name = (
request_output.lora_request.lora_name
if request_output.lora_request
else "base_model"
)
print(f"Request {request_output.request_id} [{adapter_name}]:")
# Access each completion sequence
for completion in request_output.outputs:
print(f" Output {completion.index}: {completion.text}")
if completion.finish_reason:
print(f" Finish reason: {completion.finish_reason}")
Complete Multi-LoRA Serving Loop with Output Processing
from vllm import EngineArgs, LLMEngine, SamplingParams, RequestOutput
from vllm.lora.request import LoRARequest
from huggingface_hub import snapshot_download
# Setup
engine = LLMEngine.from_engine_args(EngineArgs(
model="meta-llama/Llama-3.2-3B-Instruct",
enable_lora=True, max_loras=1, max_lora_rank=8, max_cpu_loras=2,
))
lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")
# Submit requests
engine.add_request("0", "A robot may not injure a human being",
SamplingParams(temperature=0.0, max_tokens=128))
engine.add_request("1", "[user] Write a SQL query... [/user] [assistant]",
SamplingParams(temperature=0.0, max_tokens=128),
lora_request=LoRARequest("sql-lora", 1, lora_path))
# Process outputs
while engine.has_unfinished_requests():
outputs = engine.step()
for output in outputs:
if output.finished:
print(output)