Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm EngineArgs Init

From Leeroopedia


Knowledge Sources
Domains LLM Serving, Model Configuration, GPU Computing
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for configuring the vLLM inference engine provided by the vllm library.

Description

EngineArgs is a Python dataclass that defines every tunable parameter for the vLLM serving engine. It aggregates configuration for model loading, parallelism strategy, memory management, quantization, LoRA adapters, speculative decoding, scheduling, and observability into a single flat structure. Each field has a sensible default derived from the corresponding sub-config class (e.g., ModelConfig, ParallelConfig, CacheConfig).

At startup, EngineArgs is either constructed directly in Python or populated from CLI arguments via EngineArgs.add_cli_args() and AsyncEngineArgs.from_cli_args(). The create_engine_config() method then validates and transforms the flat arguments into a structured VllmConfig object consumed by the engine internals.

Usage

Use EngineArgs when you need to programmatically configure a vLLM engine instance. For CLI-based deployment, the same parameters are exposed as command-line flags by vllm serve. The most commonly tuned parameters are model, tensor_parallel_size, dtype, quantization, gpu_memory_utilization, and max_model_len.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/engine/arg_utils.py (Lines 351-590+)

Signature

@dataclass
class EngineArgs:
    """Arguments for vLLM engine."""
    model: str = ModelConfig.model
    tensor_parallel_size: int = ParallelConfig.tensor_parallel_size
    dtype: ModelDType = ModelConfig.dtype  # default "auto"
    quantization: QuantizationMethods | None = ModelConfig.quantization
    gpu_memory_utilization: float = CacheConfig.gpu_memory_utilization  # default 0.9
    max_model_len: int | None = ModelConfig.max_model_len
    enable_lora: bool = False
    speculative_config: dict[str, Any] | None = None
    pipeline_parallel_size: int = ParallelConfig.pipeline_parallel_size
    data_parallel_size: int = ParallelConfig.data_parallel_size
    trust_remote_code: bool = ModelConfig.trust_remote_code
    tokenizer: str | None = ModelConfig.tokenizer
    seed: int = ModelConfig.seed
    max_num_seqs: int | None = None
    enforce_eager: bool = ModelConfig.enforce_eager
    ...

Import

from vllm.engine.arg_utils import EngineArgs
# Or for async serving:
from vllm.engine.arg_utils import AsyncEngineArgs

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model ID or local path to the model checkpoint.
tensor_parallel_size int No Number of GPUs for tensor parallelism. Default: 1.
dtype str No Data type for model weights: "auto", "float16", "bfloat16", "float32". Default: "auto" (inferred from model config).
quantization None No Quantization method: "awq", "gptq", "squeezellm", "fp8", etc. Default: None (no quantization).
gpu_memory_utilization float No Fraction of GPU memory to use for the engine (0.0-1.0). Default: 0.9.
max_model_len None No Maximum sequence length (prompt + generation). Default: None (derived from model config).
enable_lora bool No Enable LoRA adapter support. Default: False.
speculative_config None No Configuration for speculative decoding. Default: None (disabled).
pipeline_parallel_size int No Number of pipeline parallel stages. Default: 1.
data_parallel_size int No Number of data parallel replicas. Default: 1.
trust_remote_code bool No Allow execution of remote code in model repositories. Default: False.
seed int No Random seed for reproducibility. Default: 0.
enforce_eager bool No Disable CUDA graph capture; use eager mode. Default: False.

Outputs

Name Type Description
EngineArgs instance EngineArgs A configured dataclass instance. Call create_engine_config() to produce a validated VllmConfig.

Usage Examples

Basic Engine Configuration

from vllm.engine.arg_utils import EngineArgs

# Configure for a 7B model on 2 GPUs with half precision
engine_args = EngineArgs(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=2,
    dtype="float16",
    gpu_memory_utilization=0.9,
    max_model_len=4096,
)

# Convert to engine config for internal use
vllm_config = engine_args.create_engine_config()

Quantized Model Configuration

from vllm.engine.arg_utils import EngineArgs

# Serve a 70B model with AWQ quantization on 4 GPUs
engine_args = EngineArgs(
    model="TheBloke/Llama-2-70B-Chat-AWQ",
    tensor_parallel_size=4,
    quantization="awq",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
)

Configuration with LoRA Support

from vllm.engine.arg_utils import EngineArgs

engine_args = EngineArgs(
    model="meta-llama/Llama-2-7b-hf",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=16,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment