Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm EngineArgs LoRA Config

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Model Adaptation, Engine Configuration
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for configuring a vLLM inference engine with LoRA adapter support provided by vllm.

Description

The EngineArgs dataclass collects all configuration parameters for the vLLM engine, including LoRA-specific fields defined at lines 481-490 of vllm/engine/arg_utils.py. Setting enable_lora=True activates LoRA support, which causes the engine to allocate adapter slots and prepare the LoRA weight management infrastructure. The configured EngineArgs instance is then passed to LLMEngine.from_engine_args() (defined at lines 155-181 of vllm/v1/engine/llm_engine.py) to construct the engine with LoRA capabilities.

The LoRA-specific parameters in EngineArgs draw their defaults from LoRAConfig (defined in vllm/config/lora.py), which validates constraints such as max_cpu_loras >= max_loras and restricts max_lora_rank to specific allowed values.

Usage

Use this API when initializing a vLLM engine for multi-LoRA serving. Create an EngineArgs instance with enable_lora=True and the desired LoRA parameters, then call LLMEngine.from_engine_args() to construct the engine.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/engine/arg_utils.py (lines 481-490 for LoRA params)
  • File: vllm/v1/engine/llm_engine.py (lines 155-181 for from_engine_args)

Signature

# EngineArgs construction with LoRA-related parameters
EngineArgs(
    model: str,
    enable_lora: bool = False,
    max_loras: int = 1,
    max_lora_rank: int = 16,
    max_cpu_loras: int | None = None,
    lora_dtype: str | torch.dtype | None = "auto",
    fully_sharded_loras: bool = False,
    max_num_seqs: int = 256,
    # ... additional non-LoRA parameters omitted
)

# Engine construction from args
LLMEngine.from_engine_args(
    engine_args: EngineArgs,
    usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
    stat_loggers: list[StatLoggerFactory] | None = None,
    enable_multiprocessing: bool = False,
) -> LLMEngine

Import

from vllm.engine.arg_utils import EngineArgs
from vllm.v1.engine.llm_engine import LLMEngine

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model ID or local path to the base model (e.g., "meta-llama/Llama-3.2-3B-Instruct")
enable_lora bool Yes (must be True) Activates LoRA adapter support in the engine. Must be set to True for multi-LoRA serving.
max_loras int No Maximum number of LoRA adapters that can be active in a single batch. Default: 1. Higher values increase GPU memory usage.
max_lora_rank int No Maximum supported rank for all LoRA adapters. Allowed values: 1, 8, 16, 32, 64, 128, 256, 320, 512. Default: 16.
max_cpu_loras int or None No Maximum number of LoRA adapters cached in CPU memory. Must be >= max_loras. Default: None (set equal to max_loras).
lora_dtype str, torch.dtype, or None No Data type for LoRA computations. "auto" uses the base model dtype. Default: "auto".
fully_sharded_loras bool No Enable fully sharded LoRA computation across tensor-parallel ranks. Default: False.
max_num_seqs int No Maximum number of sequences per iteration. Default: 256.

Outputs

Name Type Description
engine LLMEngine A fully initialized LLM engine with LoRA support enabled, ready to accept requests with per-request LoRA adapters

Usage Examples

Initialize Engine for Multi-LoRA Serving

from vllm import EngineArgs, LLMEngine

engine_args = EngineArgs(
    model="meta-llama/Llama-3.2-3B-Instruct",
    enable_lora=True,
    max_loras=1,
    max_lora_rank=8,
    max_cpu_loras=2,
    max_num_seqs=256,
)
engine = LLMEngine.from_engine_args(engine_args)

High-Throughput Multi-Adapter Configuration

from vllm import EngineArgs, LLMEngine

# Allow 4 concurrent adapters in a batch with CPU caching for 16
engine_args = EngineArgs(
    model="meta-llama/Llama-3.2-3B-Instruct",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=32,
    max_cpu_loras=16,
    fully_sharded_loras=True,
)
engine = LLMEngine.from_engine_args(engine_args)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment