Implementation:Pytorch Serve vLLM Engine Config
| Field | Value |
|---|---|
| Page Type | Implementation |
| Implementation Type | Pattern Doc |
| Domains | LLM_Serving, Configuration |
| Knowledge Sources | TorchServe |
| Workflow | LLM_Deployment_vLLM |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
This page documents the concrete YAML configuration patterns used to configure vLLM engine parameters within TorchServe's model-config.yaml files. These configurations control how the vLLM engine loads models, manages batching, shards across GPUs, and serves LoRA adapters. Examples are drawn directly from the TorchServe repository's example configurations for Llama 3, Mistral, and LoRA-enabled serving.
Description
The model-config.yaml file is the declarative interface between TorchServe's Java frontend and the Python vLLM handler. It is structured into two logical blocks: frontend parameters (consumed by the TorchServe Java process) and handler parameters (consumed by VLLMHandler.initialize()).
The handler reads the YAML via ctx.model_yaml_config and extracts the vllm_engine_config dictionary, which is mapped onto vLLM's AsyncEngineArgs dataclass. Any parameter supported by AsyncEngineArgs can be set in this section.
Usage
Creating a Model Configuration
Create a model-config.yaml file and include it when packaging the model archive:
torch-model-archiver --model-name my_llm \
--version 1.0 \
--handler vllm_handler \
--config-file model-config.yaml \
--archive-format no-archive
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
examples/large_models/vllm/llama3/model-config.yaml |
L1-16 | Llama 3.1 8B configuration |
examples/large_models/vllm/mistral/model-config.yaml |
L1-16 | Mistral 7B with tensor parallelism |
examples/large_models/vllm/lora/model-config.yaml |
L1-23 | LoRA-enabled Llama 3.1 8B configuration |
ts/torch_handler/vllm_handler.py |
L163-182 | _get_vllm_engine_config() parses the YAML into AsyncEngineArgs
|
Signature
The configuration is consumed by the VLLMHandler's internal method:
def _get_vllm_engine_config(self, handler_config: dict):
"""
Parse handler config dict into vLLM AsyncEngineArgs.
Parameters:
handler_config (dict): The 'handler' section of model-config.yaml.
Returns:
AsyncEngineArgs: Configured engine arguments for vLLM.
"""
Import
# The configuration is YAML-based and does not require Python imports.
# It is consumed internally by:
from vllm import AsyncEngineArgs
# within VLLMHandler._get_vllm_engine_config()
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | YAML file | model-config.yaml bundled in model archive or referenced via --config-file
|
| Output | Python object | AsyncEngineArgs dataclass populated with engine parameters
|
| Precondition | File system | Model weights accessible at the path specified in handler.model_path
|
| Postcondition | Engine state | vLLM AsyncLLMEngine initialized with the specified configuration
|
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
handler.model_path |
string | (required) | Path to model weights or HuggingFace model ID |
handler.vllm_engine_config.max_num_seqs |
int | 256 | Maximum concurrent sequences in continuous batch |
handler.vllm_engine_config.max_model_len |
int | model default | Maximum context length in tokens |
handler.vllm_engine_config.tensor_parallel_size |
int | 1 | Number of GPUs for tensor parallelism |
handler.vllm_engine_config.served_model_name |
list[string] | [model name] | Model aliases for OpenAI-compatible API |
handler.vllm_engine_config.enable_lora |
bool | false | Enable LoRA adapter support |
handler.vllm_engine_config.max_loras |
int | -- | Maximum simultaneously loaded LoRA adapters |
handler.vllm_engine_config.max_cpu_loras |
int | -- | Maximum LoRA adapters cached on CPU |
handler.vllm_engine_config.max_lora_rank |
int | -- | Maximum rank for LoRA adapters |
handler.adapters |
dict | {} | Mapping of adapter name to adapter weight path |
Usage Examples
Example 1: Llama 3.1 8B Basic Configuration
From examples/large_models/vllm/llama3/model-config.yaml:
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true
handler:
model_path: "model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16"
vllm_engine_config:
max_num_seqs: 16
max_model_len: 250
served_model_name:
- "meta-llama/Meta-Llama-3.1-8B"
- "llama3-8b"
This configuration:
- Runs a single worker (vLLM handles concurrency internally)
- Limits the context window to 250 tokens (for testing; production would use higher values)
- Allows up to 16 concurrent sequences in the continuous batch
- Exposes the model under two aliases
Example 2: Mistral 7B with Tensor Parallelism
From examples/large_models/vllm/mistral/model-config.yaml:
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true
handler:
model_path: "model/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24"
vllm_engine_config:
max_model_len: 250
max_num_seqs: 16
tensor_parallel_size: 4
served_model_name:
- "mistral"
This configuration:
- Shards the model across 4 GPUs using tensor parallelism
- Each GPU holds approximately 1/4 of the model's weight matrices
- Requires a node with at least 4 NVIDIA GPUs and NCCL communication
Example 3: Llama 3.1 8B with LoRA Adapters
From examples/large_models/vllm/lora/model-config.yaml:
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true
handler:
model_path: "model/models--meta-llama--Meta-Llama-3.1-8B/snapshots/48d6d0fc4e02fb1269b36940650a1b7233035cbb"
vllm_engine_config:
enable_lora: true
max_loras: 4
max_cpu_loras: 4
max_lora_rank: 32
max_num_seqs: 16
max_model_len: 250
served_model_name:
- "meta-llama/Meta-Llama-3.1-8B"
- "llama-8b-lora"
adapters:
adapter_1: "adapters/model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825"
This configuration:
- Enables LoRA support with up to 4 simultaneously loaded adapters
- Sets maximum LoRA rank to 32
- Registers one adapter (
adapter_1) pointing to a summarization fine-tune - The base model is shared; adapters add minimal memory overhead
Related Pages
- Principle:Pytorch_Serve_vLLM_Model_Configuration -- the theoretical basis for declarative model configuration in LLM serving
- Environment:Pytorch_Serve_vLLM_Engine_Environment - vLLM engine and its dependencies