Principle:Vllm project Vllm LoRA Output Processing

Knowledge Sources	vLLM vLLM LoRA Docs
Domains	LLM Serving, Model Adaptation, Output Processing
Last Updated	2026-02-08 13:00 GMT

Overview

LoRA output processing is the mechanism by which inference results are structured to include both the generated text and the identity of the LoRA adapter that produced them.

Description

When a multi-LoRA engine processes requests, each output must carry not only the generated text, token IDs, and log probabilities, but also information about which LoRA adapter (if any) was used to produce the result. This adapter attribution is essential for correctly routing responses back to callers, for logging and debugging, and for systems that need to track which fine-tuned model variant generated a particular output.

The output processing layer produces a two-level structure: a request-level output object that contains metadata about the overall request (prompt, finish status, metrics, LoRA adapter), and one or more completion-level output objects that contain the generated sequences. Both levels carry a reference to the LoRA adapter, enabling adapter attribution at the granularity needed by the consuming application.

Usage

Use LoRA output processing when:

Consuming outputs from a multi-LoRA engine and needing to identify which adapter produced each result
Building response routing logic that maps outputs back to adapter-specific handlers
Logging or auditing which fine-tuned model variant generated a particular response
Implementing a continuous batching loop that must distinguish between base-model and adapter-augmented outputs
Aggregating streaming outputs that arrive incrementally across multiple engine steps

Theoretical Basis

The output processing design follows a hierarchical output model:

RequestOutput (Request Level): Each RequestOutput represents the complete or in-progress output of a single inference request. It contains the request ID, the original prompt, a finished flag, and a lora_request attribute that identifies the adapter used. The finished attribute is a boolean indicating whether all output sequences have reached a stopping condition (EOS token, max tokens, or stop string).

CompletionOutput (Sequence Level): Each CompletionOutput represents one generated sequence within a request (there may be multiple when n > 1). It contains the generated text, token IDs, cumulative log probability, per-token log probabilities, finish reason, and its own lora_request reference for per-sequence adapter attribution.

Incremental Updates: The RequestOutput.add() method supports merging incremental outputs from successive engine steps. In streaming mode, each step produces a partial output that is aggregated into the full result. The merge logic handles both aggregation (appending tokens) and replacement (overwriting with the latest state) modes.

Finish Semantics: A request is finished when all its completion sequences have a non-None finish_reason. The finish reason can be "stop" (stop string or EOS), "length" (max tokens reached), or other engine-specific reasons.

Related Pages

Implemented By

Implementation:Vllm_project_Vllm_RequestOutput_LoRA_Access

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment