Workflow:Vllm project Vllm Multi LoRA Serving
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, LoRA, Fine_Tuning |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
End-to-end process for serving multiple LoRA adapters concurrently on a single base model using vLLM's multi-LoRA inference engine.
Description
This workflow covers loading a base LLM and dynamically applying multiple Low-Rank Adaptation (LoRA) adapters at inference time. vLLM supports serving requests with different LoRA adapters in the same batch, efficiently managing adapter weights in GPU and CPU memory with an LRU cache. This enables cost-effective deployment of many fine-tuned model variants without duplicating the full base model weights for each.
Usage
Execute this workflow when you have a base model with multiple LoRA adapters (e.g., task-specific fine-tunes) and need to serve them simultaneously. Typical scenarios include multi-tenant serving where different users have different fine-tuned models, A/B testing between adapter variants, and serving specialized adapters (SQL, code, chat) from a shared base.
Execution Steps
Step 1: Prepare LoRA Adapters
Obtain or train LoRA adapter checkpoints compatible with the base model. Each adapter consists of small weight matrices (typically <1% of base model size) stored as a PEFT checkpoint on disk or HuggingFace Hub.
Key considerations:
- Adapters must be trained on the same base model architecture
- Adapter rank affects both quality and memory usage
- Adapters can be hosted on HuggingFace Hub or stored locally
- Verify adapter compatibility with the base model's layer structure
Step 2: Configure the Engine for LoRA
Initialize the LLM engine with LoRA support enabled. Configure the maximum number of concurrent LoRA adapters, the maximum supported rank, and the CPU cache size for adapter hot-swapping.
Key considerations:
- enable_lora=True activates multi-LoRA support
- max_loras controls how many adapters can be active in a single batch
- max_lora_rank must be >= the rank of all adapters you plan to use
- max_cpu_loras controls the CPU-side LRU cache for adapter swapping
- Higher max_loras increases GPU memory usage due to preallocated slots
Step 3: Create LoRA Requests
For each inference request, create a LoRARequest object specifying the adapter name, a unique numeric ID, and the path to the adapter weights. Requests without a LoRA adapter use the base model directly.
Key considerations:
- Each unique LoRA adapter needs a unique integer ID
- The adapter path can be a local directory or HuggingFace repo ID
- Requests can freely mix base-model and LoRA-adapted inferences
- The same adapter ID must consistently map to the same adapter weights
Step 4: Submit Mixed Requests
Submit inference requests to the engine, each optionally tagged with a LoRARequest. The engine scheduler batches requests with compatible adapters together and manages adapter loading/unloading transparently.
Key considerations:
- Requests with different adapters can be submitted in any order
- The engine handles adapter hot-swapping via the LRU cache
- When max_loras is exceeded, lower-priority adapters are evicted to CPU
- Throughput may vary as adapter switching has overhead
Step 5: Process Outputs
Collect and process outputs from each request. Each output is associated with its original LoRA adapter (or base model), enabling routing of results back to the appropriate context.
Key considerations:
- Outputs maintain the same ordering as submitted requests
- The LoRA adapter used for each output can be tracked via request metadata
- Quality of outputs depends on the adapter training quality