Principle:Pytorch Serve vLLM Model Configuration

Field	Value
Page Type	Principle
Domains	LLM_Serving, Configuration
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

Configuring LLM serving parameters through declarative YAML is a foundational pattern in TorchServe's vLLM integration. The model configuration file controls model selection, context length, tensor parallelism, continuous batching limits, LoRA adapter mounting, and served model naming. This declarative approach separates model serving behavior from application code, enabling operators to tune inference characteristics without modifying handler logic.

Description

Declarative Configuration Pattern

TorchServe uses a model-config.yaml file as the single source of truth for how a model is loaded, parallelized, and served. This file is bundled into the model archive (MAR) and read at model registration time. The YAML structure has two logical sections:

Frontend Parameters control TorchServe's Java frontend behavior:

minWorkers / maxWorkers -- the number of worker processes allocated to this model. For LLMs with vLLM, this is typically set to 1 because vLLM handles concurrency internally via continuous batching.
maxBatchDelay -- maximum time in milliseconds to wait for a batch to fill before dispatching. With vLLM's internal batching, this is a secondary concern.
startupTimeout -- time allowed for the model to load into GPU memory. Large models (7B+ parameters) require extended timeouts (1200+ seconds).
deviceType -- set to "gpu" for LLM workloads.
asyncCommunication -- must be true for vLLM, which uses asynchronous inference.

Handler Parameters control the Python handler and vLLM engine:

handler.model_path -- the path to model weights, either a local directory or a HuggingFace model identifier.
handler.vllm_engine_config -- a nested dictionary that maps directly to vLLM's AsyncEngineArgs, controlling the engine's behavior.

Key Engine Configuration Parameters

The vllm_engine_config section supports all parameters accepted by vLLM's AsyncEngineArgs:

max_num_seqs -- maximum number of sequences that can be processed simultaneously. This controls the degree of continuous batching. Higher values increase throughput but require more GPU memory.
max_model_len -- the maximum context length (in tokens) the model will accept. Setting this lower than the model's native context window reduces memory usage proportionally.
tensor_parallel_size -- number of GPUs across which the model is sharded. For multi-GPU setups, this enables serving models that do not fit on a single GPU.
served_model_name -- a list of aliases under which the model is exposed via the OpenAI-compatible API.
enable_lora -- when true, activates LoRA adapter support in the vLLM engine, allowing multiple fine-tuned variants to share a single base model.
max_loras / max_cpu_loras / max_lora_rank -- LoRA-specific parameters controlling the number of simultaneously loaded adapters and their maximum rank.

LoRA Adapter Configuration

When LoRA is enabled, the handler section includes an adapters mapping that associates adapter names with paths to LoRA weight directories:

handler:
    adapters:
        adapter_1: "path/to/lora/weights"

This allows requests to specify which adapter to apply at inference time, enabling multi-tenant fine-tuned serving from a single base model.

Usage

Model configuration is authored during the model packaging phase and consumed at model registration time. The typical workflow is:

Author a model-config.yaml with the desired parameters
Package it into a model archive using torch-model-archiver (or use the no-archive format)
Register the model with TorchServe via the management API or CLI
TorchServe reads the configuration, passes it to VLLMHandler.initialize(), which constructs the vLLM engine accordingly

Operators tune the configuration based on:

Available GPU memory -- reducing max_model_len and max_num_seqs for smaller GPUs
Latency vs. throughput tradeoff -- higher max_num_seqs increases throughput at the cost of per-request latency
Multi-GPU topology -- setting tensor_parallel_size to match the number of available GPUs

Theoretical Basis

The declarative configuration pattern follows the separation of concerns principle from software engineering. By externalizing serving parameters into a YAML file:

Handler code remains generic -- the same VLLMHandler serves any model without code changes
Configuration is auditable -- YAML files can be version-controlled and diffed
Deployment is reproducible -- the same config file produces identical serving behavior across environments

The continuous batching concept controlled by max_num_seqs is a key innovation in LLM serving. Unlike static batching (where a fixed batch must be assembled before inference begins), continuous batching allows new requests to join an in-progress batch as tokens are generated. This dramatically improves GPU utilization for autoregressive text generation, where different requests complete at different times.

Tensor parallelism splits model layers across multiple GPUs. Each GPU holds a shard of the weight matrices and performs partial computations, communicating intermediate results via NCCL all-reduce operations. The tensor_parallel_size parameter must match the physical GPU count, and the model's hidden dimensions must be evenly divisible by this value.

LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning by injecting small trainable matrices into transformer layers. At serving time, multiple LoRA adapters can be loaded alongside a shared base model, with each request specifying which adapter to apply. This enables multi-tenant serving with minimal additional memory overhead proportional to the adapter rank.

Related Pages

Implementation:Pytorch_Serve_vLLM_Engine_Config -- concrete YAML configuration examples from the TorchServe examples directory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment