Implementation:Vllm project Vllm RequestOutput LoRA Access

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Serving, Model Adaptation, Output Processing
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for accessing inference outputs with per-request LoRA adapter attribution provided by vllm.

Description

The RequestOutput class (defined at lines 86-193 of vllm/outputs.py) represents the output of a single completion request to the vLLM engine. It contains the request ID, original prompt, prompt token IDs, prompt log probabilities, a list of CompletionOutput objects, a finished flag, and a lora_request attribute identifying which LoRA adapter was used.

The CompletionOutput class (defined at lines 22-65 of vllm/outputs.py) represents one generated sequence within a request. It contains the generated text, token IDs, cumulative log probability, per-token log probabilities, finish reason, stop reason, and its own lora_request attribute for per-sequence adapter attribution.

Both classes are dataclasses that are returned by LLMEngine.step() during the inference loop. The RequestOutput.add() method supports merging incremental outputs for streaming scenarios.

Usage

Use these classes to inspect and process outputs from the vLLM engine. Check output.finished to determine if a request is complete. Access output.outputs to get the list of generated sequences. Check output.lora_request or completion.lora_request to identify which LoRA adapter produced the result.

Code Reference

Source Location

Repository: vllm
File: vllm/outputs.py (lines 86-193 for RequestOutput, lines 22-65 for CompletionOutput)

Signature

# RequestOutput (lines 86-193)
class RequestOutput:
    request_id: str
    prompt: str | None
    prompt_token_ids: list[int] | None
    prompt_logprobs: PromptLogprobs | None
    outputs: list[CompletionOutput]
    finished: bool
    metrics: RequestStateStats | None = None
    lora_request: LoRARequest | None = None
    encoder_prompt: str | None = None
    encoder_prompt_token_ids: list[int] | None = None
    num_cached_tokens: int | None = None

# CompletionOutput (lines 22-65)
@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    routed_experts: np.ndarray | None = None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

Name	Type	Required	Description
(returned by engine)	--	--	RequestOutput and CompletionOutput objects are produced by LLMEngine.step(), not constructed directly by the user.

Outputs (RequestOutput attributes)

Name	Type	Description
request_id	str	The unique string identifier of the completed request
prompt	str or None	The original prompt string submitted with the request
prompt_token_ids	list[int] or None	Token IDs of the prompt after tokenization
prompt_logprobs	PromptLogprobs or None	Log probabilities for each prompt token (if requested)
outputs	list[CompletionOutput]	List of generated completion sequences (one per n value)
finished	bool	True when all output sequences have reached a stopping condition
metrics	RequestStateStats or None	Performance metrics for the request (time to first token, etc.)
lora_request	LoRARequest or None	The LoRA adapter used for this request, or None for base model
num_cached_tokens	int or None	Number of prompt tokens served from the prefix cache

Outputs (CompletionOutput attributes)

Name	Type	Description
index	int	Index of this completion within the request (0-based)
text	str	The generated output text
token_ids	Sequence[int]	Token IDs of the generated output
cumulative_logprob	float or None	Sum of log probabilities of all generated tokens
logprobs	SampleLogprobs or None	Per-token log probabilities of top candidates (if requested)
finish_reason	str or None	Why generation stopped: "stop", "length", or None if still in progress
stop_reason	int, str, or None	The specific stop token ID or string that triggered completion
lora_request	LoRARequest or None	The LoRA adapter that produced this specific completion

Usage Examples

Process Multi-LoRA Outputs

from vllm import LLMEngine, RequestOutput

# In the engine processing loop
request_outputs: list[RequestOutput] = engine.step()

for request_output in request_outputs:
    if request_output.finished:
        # Identify which adapter was used
        adapter_name = (
            request_output.lora_request.lora_name
            if request_output.lora_request
            else "base_model"
        )
        print(f"Request {request_output.request_id} [{adapter_name}]:")

        # Access each completion sequence
        for completion in request_output.outputs:
            print(f"  Output {completion.index}: {completion.text}")
            if completion.finish_reason:
                print(f"  Finish reason: {completion.finish_reason}")

Complete Multi-LoRA Serving Loop with Output Processing

from vllm import EngineArgs, LLMEngine, SamplingParams, RequestOutput
from vllm.lora.request import LoRARequest
from huggingface_hub import snapshot_download

# Setup
engine = LLMEngine.from_engine_args(EngineArgs(
    model="meta-llama/Llama-3.2-3B-Instruct",
    enable_lora=True, max_loras=1, max_lora_rank=8, max_cpu_loras=2,
))
lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")

# Submit requests
engine.add_request("0", "A robot may not injure a human being",
                   SamplingParams(temperature=0.0, max_tokens=128))
engine.add_request("1", "[user] Write a SQL query... [/user] [assistant]",
                   SamplingParams(temperature=0.0, max_tokens=128),
                   lora_request=LoRARequest("sql-lora", 1, lora_path))

# Process outputs
while engine.has_unfinished_requests():
    outputs = engine.step()
    for output in outputs:
        if output.finished:
            print(output)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_LoRA_Output_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment