Workflow:Vllm project Vllm Multi LoRA Serving

Knowledge Sources	vLLM vLLM Docs PEFT Documentation
Domains	LLMs, Inference, LoRA, Fine_Tuning
Last Updated	2026-02-08 13:00 GMT

Overview

End-to-end process for serving multiple LoRA adapters concurrently on a single base model using vLLM's multi-LoRA inference engine.

Description

This workflow covers loading a base LLM and dynamically applying multiple Low-Rank Adaptation (LoRA) adapters at inference time. vLLM supports serving requests with different LoRA adapters in the same batch, efficiently managing adapter weights in GPU and CPU memory with an LRU cache. This enables cost-effective deployment of many fine-tuned model variants without duplicating the full base model weights for each.

Usage

Execute this workflow when you have a base model with multiple LoRA adapters (e.g., task-specific fine-tunes) and need to serve them simultaneously. Typical scenarios include multi-tenant serving where different users have different fine-tuned models, A/B testing between adapter variants, and serving specialized adapters (SQL, code, chat) from a shared base.

Execution Steps

Step 1: Prepare LoRA Adapters

Obtain or train LoRA adapter checkpoints compatible with the base model. Each adapter consists of small weight matrices (typically <1% of base model size) stored as a PEFT checkpoint on disk or HuggingFace Hub.

Key considerations:

Adapters must be trained on the same base model architecture
Adapter rank affects both quality and memory usage
Adapters can be hosted on HuggingFace Hub or stored locally
Verify adapter compatibility with the base model's layer structure

Step 2: Configure the Engine for LoRA

Initialize the LLM engine with LoRA support enabled. Configure the maximum number of concurrent LoRA adapters, the maximum supported rank, and the CPU cache size for adapter hot-swapping.

Key considerations:

enable_lora=True activates multi-LoRA support
max_loras controls how many adapters can be active in a single batch
max_lora_rank must be >= the rank of all adapters you plan to use
max_cpu_loras controls the CPU-side LRU cache for adapter swapping
Higher max_loras increases GPU memory usage due to preallocated slots

Step 3: Create LoRA Requests

For each inference request, create a LoRARequest object specifying the adapter name, a unique numeric ID, and the path to the adapter weights. Requests without a LoRA adapter use the base model directly.

Key considerations:

Each unique LoRA adapter needs a unique integer ID
The adapter path can be a local directory or HuggingFace repo ID
Requests can freely mix base-model and LoRA-adapted inferences
The same adapter ID must consistently map to the same adapter weights

Step 4: Submit Mixed Requests

Submit inference requests to the engine, each optionally tagged with a LoRARequest. The engine scheduler batches requests with compatible adapters together and manages adapter loading/unloading transparently.

Key considerations:

Requests with different adapters can be submitted in any order
The engine handles adapter hot-swapping via the LRU cache
When max_loras is exceeded, lower-priority adapters are evicted to CPU
Throughput may vary as adapter switching has overhead

Step 5: Process Outputs

Collect and process outputs from each request. Each output is associated with its original LoRA adapter (or base model), enabling routing of results back to the appropriate context.

Key considerations:

Outputs maintain the same ordering as submitted requests
The LoRA adapter used for each output can be tracked via request metadata
Quality of outputs depends on the adapter training quality

Execution Diagram

GitHub URL

Workflow Repository