Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve vLLM Engine Config

From Leeroopedia
Field Value
Page Type Implementation
Implementation Type Pattern Doc
Domains LLM_Serving, Configuration
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

This page documents the concrete YAML configuration patterns used to configure vLLM engine parameters within TorchServe's model-config.yaml files. These configurations control how the vLLM engine loads models, manages batching, shards across GPUs, and serves LoRA adapters. Examples are drawn directly from the TorchServe repository's example configurations for Llama 3, Mistral, and LoRA-enabled serving.

Description

The model-config.yaml file is the declarative interface between TorchServe's Java frontend and the Python vLLM handler. It is structured into two logical blocks: frontend parameters (consumed by the TorchServe Java process) and handler parameters (consumed by VLLMHandler.initialize()).

The handler reads the YAML via ctx.model_yaml_config and extracts the vllm_engine_config dictionary, which is mapped onto vLLM's AsyncEngineArgs dataclass. Any parameter supported by AsyncEngineArgs can be set in this section.

Usage

Creating a Model Configuration

Create a model-config.yaml file and include it when packaging the model archive:

torch-model-archiver --model-name my_llm \
    --version 1.0 \
    --handler vllm_handler \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File Lines Description
examples/large_models/vllm/llama3/model-config.yaml L1-16 Llama 3.1 8B configuration
examples/large_models/vllm/mistral/model-config.yaml L1-16 Mistral 7B with tensor parallelism
examples/large_models/vllm/lora/model-config.yaml L1-23 LoRA-enabled Llama 3.1 8B configuration
ts/torch_handler/vllm_handler.py L163-182 _get_vllm_engine_config() parses the YAML into AsyncEngineArgs

Signature

The configuration is consumed by the VLLMHandler's internal method:

def _get_vllm_engine_config(self, handler_config: dict):
    """
    Parse handler config dict into vLLM AsyncEngineArgs.

    Parameters:
        handler_config (dict): The 'handler' section of model-config.yaml.

    Returns:
        AsyncEngineArgs: Configured engine arguments for vLLM.
    """

Import

# The configuration is YAML-based and does not require Python imports.
# It is consumed internally by:
from vllm import AsyncEngineArgs
# within VLLMHandler._get_vllm_engine_config()

I/O Contract

Direction Type Description
Input YAML file model-config.yaml bundled in model archive or referenced via --config-file
Output Python object AsyncEngineArgs dataclass populated with engine parameters
Precondition File system Model weights accessible at the path specified in handler.model_path
Postcondition Engine state vLLM AsyncLLMEngine initialized with the specified configuration

Key Parameters

Parameter Type Default Description
handler.model_path string (required) Path to model weights or HuggingFace model ID
handler.vllm_engine_config.max_num_seqs int 256 Maximum concurrent sequences in continuous batch
handler.vllm_engine_config.max_model_len int model default Maximum context length in tokens
handler.vllm_engine_config.tensor_parallel_size int 1 Number of GPUs for tensor parallelism
handler.vllm_engine_config.served_model_name list[string] [model name] Model aliases for OpenAI-compatible API
handler.vllm_engine_config.enable_lora bool false Enable LoRA adapter support
handler.vllm_engine_config.max_loras int -- Maximum simultaneously loaded LoRA adapters
handler.vllm_engine_config.max_cpu_loras int -- Maximum LoRA adapters cached on CPU
handler.vllm_engine_config.max_lora_rank int -- Maximum rank for LoRA adapters
handler.adapters dict {} Mapping of adapter name to adapter weight path

Usage Examples

Example 1: Llama 3.1 8B Basic Configuration

From examples/large_models/vllm/llama3/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16"
    vllm_engine_config:
        max_num_seqs: 16
        max_model_len: 250
        served_model_name:
            - "meta-llama/Meta-Llama-3.1-8B"
            - "llama3-8b"

This configuration:

  • Runs a single worker (vLLM handles concurrency internally)
  • Limits the context window to 250 tokens (for testing; production would use higher values)
  • Allows up to 16 concurrent sequences in the continuous batch
  • Exposes the model under two aliases

Example 2: Mistral 7B with Tensor Parallelism

From examples/large_models/vllm/mistral/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24"
    vllm_engine_config:
        max_model_len: 250
        max_num_seqs: 16
        tensor_parallel_size: 4
        served_model_name:
            - "mistral"

This configuration:

  • Shards the model across 4 GPUs using tensor parallelism
  • Each GPU holds approximately 1/4 of the model's weight matrices
  • Requires a node with at least 4 NVIDIA GPUs and NCCL communication

Example 3: Llama 3.1 8B with LoRA Adapters

From examples/large_models/vllm/lora/model-config.yaml:

# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
startupTimeout: 1200
deviceType: "gpu"
asyncCommunication: true

handler:
    model_path: "model/models--meta-llama--Meta-Llama-3.1-8B/snapshots/48d6d0fc4e02fb1269b36940650a1b7233035cbb"
    vllm_engine_config:
        enable_lora: true
        max_loras: 4
        max_cpu_loras: 4
        max_lora_rank: 32
        max_num_seqs: 16
        max_model_len: 250
        served_model_name:
            - "meta-llama/Meta-Llama-3.1-8B"
            - "llama-8b-lora"

    adapters:
        adapter_1: "adapters/model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825"

This configuration:

  • Enables LoRA support with up to 4 simultaneously loaded adapters
  • Sets maximum LoRA rank to 32
  • Registers one adapter (adapter_1) pointing to a summarization fine-tune
  • The base model is shared; adapters add minimal memory overhead

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment