Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm RequestOutput LoRA Access

From Leeroopedia
Revision as of 17:06, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Vllm_project_Vllm_RequestOutput_LoRA_Access.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLM Serving, Model Adaptation, Output Processing
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for accessing inference outputs with per-request LoRA adapter attribution provided by vllm.

Description

The RequestOutput class (defined at lines 86-193 of vllm/outputs.py) represents the output of a single completion request to the vLLM engine. It contains the request ID, original prompt, prompt token IDs, prompt log probabilities, a list of CompletionOutput objects, a finished flag, and a lora_request attribute identifying which LoRA adapter was used.

The CompletionOutput class (defined at lines 22-65 of vllm/outputs.py) represents one generated sequence within a request. It contains the generated text, token IDs, cumulative log probability, per-token log probabilities, finish reason, stop reason, and its own lora_request attribute for per-sequence adapter attribution.

Both classes are dataclasses that are returned by LLMEngine.step() during the inference loop. The RequestOutput.add() method supports merging incremental outputs for streaming scenarios.

Usage

Use these classes to inspect and process outputs from the vLLM engine. Check output.finished to determine if a request is complete. Access output.outputs to get the list of generated sequences. Check output.lora_request or completion.lora_request to identify which LoRA adapter produced the result.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/outputs.py (lines 86-193 for RequestOutput, lines 22-65 for CompletionOutput)

Signature

# RequestOutput (lines 86-193)
class RequestOutput:
    request_id: str
    prompt: str | None
    prompt_token_ids: list[int] | None
    prompt_logprobs: PromptLogprobs | None
    outputs: list[CompletionOutput]
    finished: bool
    metrics: RequestStateStats | None = None
    lora_request: LoRARequest | None = None
    encoder_prompt: str | None = None
    encoder_prompt_token_ids: list[int] | None = None
    num_cached_tokens: int | None = None

# CompletionOutput (lines 22-65)
@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    routed_experts: np.ndarray | None = None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

Name Type Required Description
(returned by engine) -- -- RequestOutput and CompletionOutput objects are produced by LLMEngine.step(), not constructed directly by the user.

Outputs (RequestOutput attributes)

Name Type Description
request_id str The unique string identifier of the completed request
prompt str or None The original prompt string submitted with the request
prompt_token_ids list[int] or None Token IDs of the prompt after tokenization
prompt_logprobs PromptLogprobs or None Log probabilities for each prompt token (if requested)
outputs list[CompletionOutput] List of generated completion sequences (one per n value)
finished bool True when all output sequences have reached a stopping condition
metrics RequestStateStats or None Performance metrics for the request (time to first token, etc.)
lora_request LoRARequest or None The LoRA adapter used for this request, or None for base model
num_cached_tokens int or None Number of prompt tokens served from the prefix cache

Outputs (CompletionOutput attributes)

Name Type Description
index int Index of this completion within the request (0-based)
text str The generated output text
token_ids Sequence[int] Token IDs of the generated output
cumulative_logprob float or None Sum of log probabilities of all generated tokens
logprobs SampleLogprobs or None Per-token log probabilities of top candidates (if requested)
finish_reason str or None Why generation stopped: "stop", "length", or None if still in progress
stop_reason int, str, or None The specific stop token ID or string that triggered completion
lora_request LoRARequest or None The LoRA adapter that produced this specific completion

Usage Examples

Process Multi-LoRA Outputs

from vllm import LLMEngine, RequestOutput

# In the engine processing loop
request_outputs: list[RequestOutput] = engine.step()

for request_output in request_outputs:
    if request_output.finished:
        # Identify which adapter was used
        adapter_name = (
            request_output.lora_request.lora_name
            if request_output.lora_request
            else "base_model"
        )
        print(f"Request {request_output.request_id} [{adapter_name}]:")

        # Access each completion sequence
        for completion in request_output.outputs:
            print(f"  Output {completion.index}: {completion.text}")
            if completion.finish_reason:
                print(f"  Finish reason: {completion.finish_reason}")

Complete Multi-LoRA Serving Loop with Output Processing

from vllm import EngineArgs, LLMEngine, SamplingParams, RequestOutput
from vllm.lora.request import LoRARequest
from huggingface_hub import snapshot_download

# Setup
engine = LLMEngine.from_engine_args(EngineArgs(
    model="meta-llama/Llama-3.2-3B-Instruct",
    enable_lora=True, max_loras=1, max_lora_rank=8, max_cpu_loras=2,
))
lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")

# Submit requests
engine.add_request("0", "A robot may not injure a human being",
                   SamplingParams(temperature=0.0, max_tokens=128))
engine.add_request("1", "[user] Write a SQL query... [/user] [assistant]",
                   SamplingParams(temperature=0.0, max_tokens=128),
                   lora_request=LoRARequest("sql-lora", 1, lora_path))

# Process outputs
while engine.has_unfinished_requests():
    outputs = engine.step()
    for output in outputs:
        if output.finished:
            print(output)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment