Implementation:Vllm project Vllm EngineArgs Init
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Model Configuration, GPU Computing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for configuring the vLLM inference engine provided by the vllm library.
Description
EngineArgs is a Python dataclass that defines every tunable parameter for the vLLM serving engine. It aggregates configuration for model loading, parallelism strategy, memory management, quantization, LoRA adapters, speculative decoding, scheduling, and observability into a single flat structure. Each field has a sensible default derived from the corresponding sub-config class (e.g., ModelConfig, ParallelConfig, CacheConfig).
At startup, EngineArgs is either constructed directly in Python or populated from CLI arguments via EngineArgs.add_cli_args() and AsyncEngineArgs.from_cli_args(). The create_engine_config() method then validates and transforms the flat arguments into a structured VllmConfig object consumed by the engine internals.
Usage
Use EngineArgs when you need to programmatically configure a vLLM engine instance. For CLI-based deployment, the same parameters are exposed as command-line flags by vllm serve. The most commonly tuned parameters are model, tensor_parallel_size, dtype, quantization, gpu_memory_utilization, and max_model_len.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/engine/arg_utils.py(Lines 351-590+)
Signature
@dataclass
class EngineArgs:
"""Arguments for vLLM engine."""
model: str = ModelConfig.model
tensor_parallel_size: int = ParallelConfig.tensor_parallel_size
dtype: ModelDType = ModelConfig.dtype # default "auto"
quantization: QuantizationMethods | None = ModelConfig.quantization
gpu_memory_utilization: float = CacheConfig.gpu_memory_utilization # default 0.9
max_model_len: int | None = ModelConfig.max_model_len
enable_lora: bool = False
speculative_config: dict[str, Any] | None = None
pipeline_parallel_size: int = ParallelConfig.pipeline_parallel_size
data_parallel_size: int = ParallelConfig.data_parallel_size
trust_remote_code: bool = ModelConfig.trust_remote_code
tokenizer: str | None = ModelConfig.tokenizer
seed: int = ModelConfig.seed
max_num_seqs: int | None = None
enforce_eager: bool = ModelConfig.enforce_eager
...
Import
from vllm.engine.arg_utils import EngineArgs
# Or for async serving:
from vllm.engine.arg_utils import AsyncEngineArgs
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | HuggingFace model ID or local path to the model checkpoint. |
| tensor_parallel_size | int |
No | Number of GPUs for tensor parallelism. Default: 1. |
| dtype | str |
No | Data type for model weights: "auto", "float16", "bfloat16", "float32". Default: "auto" (inferred from model config). |
| quantization | None | No | Quantization method: "awq", "gptq", "squeezellm", "fp8", etc. Default: None (no quantization). |
| gpu_memory_utilization | float |
No | Fraction of GPU memory to use for the engine (0.0-1.0). Default: 0.9. |
| max_model_len | None | No | Maximum sequence length (prompt + generation). Default: None (derived from model config). |
| enable_lora | bool |
No | Enable LoRA adapter support. Default: False. |
| speculative_config | None | No | Configuration for speculative decoding. Default: None (disabled). |
| pipeline_parallel_size | int |
No | Number of pipeline parallel stages. Default: 1. |
| data_parallel_size | int |
No | Number of data parallel replicas. Default: 1. |
| trust_remote_code | bool |
No | Allow execution of remote code in model repositories. Default: False. |
| seed | int |
No | Random seed for reproducibility. Default: 0. |
| enforce_eager | bool |
No | Disable CUDA graph capture; use eager mode. Default: False. |
Outputs
| Name | Type | Description |
|---|---|---|
| EngineArgs instance | EngineArgs |
A configured dataclass instance. Call create_engine_config() to produce a validated VllmConfig.
|
Usage Examples
Basic Engine Configuration
from vllm.engine.arg_utils import EngineArgs
# Configure for a 7B model on 2 GPUs with half precision
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=2,
dtype="float16",
gpu_memory_utilization=0.9,
max_model_len=4096,
)
# Convert to engine config for internal use
vllm_config = engine_args.create_engine_config()
Quantized Model Configuration
from vllm.engine.arg_utils import EngineArgs
# Serve a 70B model with AWQ quantization on 4 GPUs
engine_args = EngineArgs(
model="TheBloke/Llama-2-70B-Chat-AWQ",
tensor_parallel_size=4,
quantization="awq",
gpu_memory_utilization=0.85,
max_model_len=4096,
)
Configuration with LoRA Support
from vllm.engine.arg_utils import EngineArgs
engine_args = EngineArgs(
model="meta-llama/Llama-2-7b-hf",
enable_lora=True,
max_loras=4,
max_lora_rank=16,
)