Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vllm project Vllm Multi LoRA Serving

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference, LoRA, Fine_Tuning
Last Updated 2026-02-08 13:00 GMT

Overview

End-to-end process for serving multiple LoRA adapters concurrently on a single base model using vLLM's multi-LoRA inference engine.

Description

This workflow covers loading a base LLM and dynamically applying multiple Low-Rank Adaptation (LoRA) adapters at inference time. vLLM supports serving requests with different LoRA adapters in the same batch, efficiently managing adapter weights in GPU and CPU memory with an LRU cache. This enables cost-effective deployment of many fine-tuned model variants without duplicating the full base model weights for each.

Usage

Execute this workflow when you have a base model with multiple LoRA adapters (e.g., task-specific fine-tunes) and need to serve them simultaneously. Typical scenarios include multi-tenant serving where different users have different fine-tuned models, A/B testing between adapter variants, and serving specialized adapters (SQL, code, chat) from a shared base.

Execution Steps

Step 1: Prepare LoRA Adapters

Obtain or train LoRA adapter checkpoints compatible with the base model. Each adapter consists of small weight matrices (typically <1% of base model size) stored as a PEFT checkpoint on disk or HuggingFace Hub.

Key considerations:

  • Adapters must be trained on the same base model architecture
  • Adapter rank affects both quality and memory usage
  • Adapters can be hosted on HuggingFace Hub or stored locally
  • Verify adapter compatibility with the base model's layer structure

Step 2: Configure the Engine for LoRA

Initialize the LLM engine with LoRA support enabled. Configure the maximum number of concurrent LoRA adapters, the maximum supported rank, and the CPU cache size for adapter hot-swapping.

Key considerations:

  • enable_lora=True activates multi-LoRA support
  • max_loras controls how many adapters can be active in a single batch
  • max_lora_rank must be >= the rank of all adapters you plan to use
  • max_cpu_loras controls the CPU-side LRU cache for adapter swapping
  • Higher max_loras increases GPU memory usage due to preallocated slots

Step 3: Create LoRA Requests

For each inference request, create a LoRARequest object specifying the adapter name, a unique numeric ID, and the path to the adapter weights. Requests without a LoRA adapter use the base model directly.

Key considerations:

  • Each unique LoRA adapter needs a unique integer ID
  • The adapter path can be a local directory or HuggingFace repo ID
  • Requests can freely mix base-model and LoRA-adapted inferences
  • The same adapter ID must consistently map to the same adapter weights

Step 4: Submit Mixed Requests

Submit inference requests to the engine, each optionally tagged with a LoRARequest. The engine scheduler batches requests with compatible adapters together and manages adapter loading/unloading transparently.

Key considerations:

  • Requests with different adapters can be submitted in any order
  • The engine handles adapter hot-swapping via the LRU cache
  • When max_loras is exceeded, lower-priority adapters are evicted to CPU
  • Throughput may vary as adapter switching has overhead

Step 5: Process Outputs

Collect and process outputs from each request. Each output is associated with its original LoRA adapter (or base model), enabling routing of results back to the appropriate context.

Key considerations:

  • Outputs maintain the same ordering as submitted requests
  • The LoRA adapter used for each output can be tracked via request metadata
  • Quality of outputs depends on the adapter training quality

Execution Diagram

GitHub URL

Workflow Repository