Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vllm project Vllm LoRA Output Processing

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Model Adaptation, Output Processing
Last Updated 2026-02-08 13:00 GMT

Overview

LoRA output processing is the mechanism by which inference results are structured to include both the generated text and the identity of the LoRA adapter that produced them.

Description

When a multi-LoRA engine processes requests, each output must carry not only the generated text, token IDs, and log probabilities, but also information about which LoRA adapter (if any) was used to produce the result. This adapter attribution is essential for correctly routing responses back to callers, for logging and debugging, and for systems that need to track which fine-tuned model variant generated a particular output.

The output processing layer produces a two-level structure: a request-level output object that contains metadata about the overall request (prompt, finish status, metrics, LoRA adapter), and one or more completion-level output objects that contain the generated sequences. Both levels carry a reference to the LoRA adapter, enabling adapter attribution at the granularity needed by the consuming application.

Usage

Use LoRA output processing when:

  • Consuming outputs from a multi-LoRA engine and needing to identify which adapter produced each result
  • Building response routing logic that maps outputs back to adapter-specific handlers
  • Logging or auditing which fine-tuned model variant generated a particular response
  • Implementing a continuous batching loop that must distinguish between base-model and adapter-augmented outputs
  • Aggregating streaming outputs that arrive incrementally across multiple engine steps

Theoretical Basis

The output processing design follows a hierarchical output model:

RequestOutput (Request Level): Each RequestOutput represents the complete or in-progress output of a single inference request. It contains the request ID, the original prompt, a finished flag, and a lora_request attribute that identifies the adapter used. The finished attribute is a boolean indicating whether all output sequences have reached a stopping condition (EOS token, max tokens, or stop string).

CompletionOutput (Sequence Level): Each CompletionOutput represents one generated sequence within a request (there may be multiple when n > 1). It contains the generated text, token IDs, cumulative log probability, per-token log probabilities, finish reason, and its own lora_request reference for per-sequence adapter attribution.

Incremental Updates: The RequestOutput.add() method supports merging incremental outputs from successive engine steps. In streaming mode, each step produces a partial output that is aggregated into the full result. The merge logic handles both aggregation (appending tokens) and replacement (overwriting with the latest state) modes.

Finish Semantics: A request is finished when all its completion sequences have a non-None finish_reason. The finish reason can be "stop" (stop string or EOS), "length" (max tokens reached), or other engine-specific reasons.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment