Implementation:Vllm project Vllm EngineArgs LoRA Config
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Model Adaptation, Engine Configuration |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for configuring a vLLM inference engine with LoRA adapter support provided by vllm.
Description
The EngineArgs dataclass collects all configuration parameters for the vLLM engine, including LoRA-specific fields defined at lines 481-490 of vllm/engine/arg_utils.py. Setting enable_lora=True activates LoRA support, which causes the engine to allocate adapter slots and prepare the LoRA weight management infrastructure. The configured EngineArgs instance is then passed to LLMEngine.from_engine_args() (defined at lines 155-181 of vllm/v1/engine/llm_engine.py) to construct the engine with LoRA capabilities.
The LoRA-specific parameters in EngineArgs draw their defaults from LoRAConfig (defined in vllm/config/lora.py), which validates constraints such as max_cpu_loras >= max_loras and restricts max_lora_rank to specific allowed values.
Usage
Use this API when initializing a vLLM engine for multi-LoRA serving. Create an EngineArgs instance with enable_lora=True and the desired LoRA parameters, then call LLMEngine.from_engine_args() to construct the engine.
Code Reference
Source Location
- Repository: vllm
- File: vllm/engine/arg_utils.py (lines 481-490 for LoRA params)
- File: vllm/v1/engine/llm_engine.py (lines 155-181 for from_engine_args)
Signature
# EngineArgs construction with LoRA-related parameters
EngineArgs(
model: str,
enable_lora: bool = False,
max_loras: int = 1,
max_lora_rank: int = 16,
max_cpu_loras: int | None = None,
lora_dtype: str | torch.dtype | None = "auto",
fully_sharded_loras: bool = False,
max_num_seqs: int = 256,
# ... additional non-LoRA parameters omitted
)
# Engine construction from args
LLMEngine.from_engine_args(
engine_args: EngineArgs,
usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
stat_loggers: list[StatLoggerFactory] | None = None,
enable_multiprocessing: bool = False,
) -> LLMEngine
Import
from vllm.engine.arg_utils import EngineArgs
from vllm.v1.engine.llm_engine import LLMEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model ID or local path to the base model (e.g., "meta-llama/Llama-3.2-3B-Instruct") |
| enable_lora | bool | Yes (must be True) | Activates LoRA adapter support in the engine. Must be set to True for multi-LoRA serving. |
| max_loras | int | No | Maximum number of LoRA adapters that can be active in a single batch. Default: 1. Higher values increase GPU memory usage. |
| max_lora_rank | int | No | Maximum supported rank for all LoRA adapters. Allowed values: 1, 8, 16, 32, 64, 128, 256, 320, 512. Default: 16. |
| max_cpu_loras | int or None | No | Maximum number of LoRA adapters cached in CPU memory. Must be >= max_loras. Default: None (set equal to max_loras). |
| lora_dtype | str, torch.dtype, or None | No | Data type for LoRA computations. "auto" uses the base model dtype. Default: "auto". |
| fully_sharded_loras | bool | No | Enable fully sharded LoRA computation across tensor-parallel ranks. Default: False. |
| max_num_seqs | int | No | Maximum number of sequences per iteration. Default: 256. |
Outputs
| Name | Type | Description |
|---|---|---|
| engine | LLMEngine | A fully initialized LLM engine with LoRA support enabled, ready to accept requests with per-request LoRA adapters |
Usage Examples
Initialize Engine for Multi-LoRA Serving
from vllm import EngineArgs, LLMEngine
engine_args = EngineArgs(
model="meta-llama/Llama-3.2-3B-Instruct",
enable_lora=True,
max_loras=1,
max_lora_rank=8,
max_cpu_loras=2,
max_num_seqs=256,
)
engine = LLMEngine.from_engine_args(engine_args)
High-Throughput Multi-Adapter Configuration
from vllm import EngineArgs, LLMEngine
# Allow 4 concurrent adapters in a batch with CPU caching for 16
engine_args = EngineArgs(
model="meta-llama/Llama-3.2-3B-Instruct",
enable_lora=True,
max_loras=4,
max_lora_rank=32,
max_cpu_loras=16,
fully_sharded_loras=True,
)
engine = LLMEngine.from_engine_args(engine_args)