Implementation:Pytorch Serve vLLM Engine Config

Field	Value
Page Type	Implementation
Implementation Type	Pattern Doc
Domains	LLM_Serving, Configuration
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

This page documents the concrete YAML configuration patterns used to configure vLLM engine parameters within TorchServe's model-config.yaml files. These configurations control how the vLLM engine loads models, manages batching, shards across GPUs, and serves LoRA adapters. Examples are drawn directly from the TorchServe repository's example configurations for Llama 3, Mistral, and LoRA-enabled serving.

Description

The model-config.yaml file is the declarative interface between TorchServe's Java frontend and the Python vLLM handler. It is structured into two logical blocks: frontend parameters (consumed by the TorchServe Java process) and handler parameters (consumed by VLLMHandler.initialize()).

The handler reads the YAML via ctx.model_yaml_config and extracts the vllm_engine_config dictionary, which is mapped onto vLLM's AsyncEngineArgs dataclass. Any parameter supported by AsyncEngineArgs can be set in this section.

Usage

Creating a Model Configuration

Create a model-config.yaml file and include it when packaging the model archive:

torch-model-archiver --model-name my_llm \
    --version 1.0 \
    --handler vllm_handler \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File	Lines	Description
`examples/large_models/vllm/llama3/model-config.yaml`	L1-16	Llama 3.1 8B configuration
`examples/large_models/vllm/mistral/model-config.yaml`	L1-16	Mistral 7B with tensor parallelism
`examples/large_models/vllm/lora/model-config.yaml`	L1-23	LoRA-enabled Llama 3.1 8B configuration
`ts/torch_handler/vllm_handler.py`	L163-182	`_get_vllm_engine_config()` parses the YAML into `AsyncEngineArgs`

Signature

The configuration is consumed by the VLLMHandler's internal method:

def _get_vllm_engine_config(self, handler_config: dict):
    """
    Parse handler config dict into vLLM AsyncEngineArgs.

    Parameters:
        handler_config (dict): The 'handler' section of model-config.yaml.

    Returns:
        AsyncEngineArgs: Configured engine arguments for vLLM.
    """

Import

# The configuration is YAML-based and does not require Python imports.
# It is consumed internally by:
from vllm import AsyncEngineArgs
# within VLLMHandler._get_vllm_engine_config()

I/O Contract

Direction	Type	Description
Input	YAML file	`model-config.yaml` bundled in model archive or referenced via `--config-file`
Output	Python object	`AsyncEngineArgs` dataclass populated with engine parameters
Precondition	File system	Model weights accessible at the path specified in `handler.model_path`
Postcondition	Engine state	vLLM `AsyncLLMEngine` initialized with the specified configuration

Key Parameters

Parameter	Type	Default	Description
`handler.model_path`	string	(required)	Path to model weights or HuggingFace model ID
`handler.vllm_engine_config.max_num_seqs`	int	256	Maximum concurrent sequences in continuous batch
`handler.vllm_engine_config.max_model_len`	int	model default	Maximum context length in tokens
`handler.vllm_engine_config.tensor_parallel_size`	int	1	Number of GPUs for tensor parallelism
`handler.vllm_engine_config.served_model_name`	list[string]	[model name]	Model aliases for OpenAI-compatible API
`handler.vllm_engine_config.enable_lora`	bool	false	Enable LoRA adapter support
`handler.vllm_engine_config.max_loras`	int	--	Maximum simultaneously loaded LoRA adapters
`handler.vllm_engine_config.max_cpu_loras`	int	--	Maximum LoRA adapters cached on CPU
`handler.vllm_engine_config.max_lora_rank`	int	--	Maximum rank for LoRA adapters
`handler.adapters`	dict	{}	Mapping of adapter name to adapter weight path

Usage Examples

Example 1: Llama 3.1 8B Basic Configuration

From examples/large_models/vllm/llama3/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16"
    vllm_engine_config:
        max_num_seqs: 16
        max_model_len: 250
        served_model_name:
            - "meta-llama/Meta-Llama-3.1-8B"
            - "llama3-8b"

This configuration:

Runs a single worker (vLLM handles concurrency internally)
Limits the context window to 250 tokens (for testing; production would use higher values)
Allows up to 16 concurrent sequences in the continuous batch
Exposes the model under two aliases

Example 2: Mistral 7B with Tensor Parallelism

From examples/large_models/vllm/mistral/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24"
    vllm_engine_config:
        max_model_len: 250
        max_num_seqs: 16
        tensor_parallel_size: 4
        served_model_name:
            - "mistral"

This configuration:

Shards the model across 4 GPUs using tensor parallelism
Each GPU holds approximately 1/4 of the model's weight matrices
Requires a node with at least 4 NVIDIA GPUs and NCCL communication

Example 3: Llama 3.1 8B with LoRA Adapters

From examples/large_models/vllm/lora/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--meta-llama--Meta-Llama-3.1-8B/snapshots/48d6d0fc4e02fb1269b36940650a1b7233035cbb"
    vllm_engine_config:
        enable_lora: true
        max_loras: 4
        max_cpu_loras: 4
        max_lora_rank: 32
        max_num_seqs: 16
        max_model_len: 250
        served_model_name:
            - "meta-llama/Meta-Llama-3.1-8B"
            - "llama-8b-lora"

    adapters:
        adapter_1: "adapters/model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825"

This configuration:

Enables LoRA support with up to 4 simultaneously loaded adapters
Sets maximum LoRA rank to 32
Registers one adapter (adapter_1) pointing to a summarization fine-tune
The base model is shared; adapters add minimal memory overhead

Related Pages

Principle:Pytorch_Serve_vLLM_Model_Configuration -- the theoretical basis for declarative model configuration in LLM serving
Environment:Pytorch_Serve_vLLM_Engine_Environment - vLLM engine and its dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment