Implementation:Pytorch Serve VLLMHandler

Field	Value
Page Type	Implementation
Implementation Type	API Doc
Domains	LLM_Serving, Inference
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

VLLMHandler is the TorchServe handler that integrates vLLM's asynchronous inference engine with TorchServe's model serving framework. It extends BaseHandler and provides async implementations of preprocess, inference, and postprocess methods. During initialization, it creates an AsyncLLMEngine and wraps it with OpenAIServingChat and OpenAIServingCompletion services to expose OpenAI-compatible API endpoints for both chat and text completion.

Description

The handler is the core bridge between TorchServe's request routing and vLLM's inference engine. It is configured via the handler section of model-config.yaml and is typically specified as vllm_handler in the model archive.

Key responsibilities:

Initialize the vLLM AsyncLLMEngine from YAML configuration
Create OpenAI-compatible serving endpoints (chat and completion)
Load LoRA adapters if configured
Route incoming requests to the correct service based on URL path
Support streaming responses via intermediate prediction responses

Usage

The handler is not typically imported directly. It is specified in the model configuration:

# In model-config.yaml, the handler is set automatically when using
# the vllm engine. For manual archive creation:
handler:
    model_path: "meta-llama/Meta-Llama-3.1-8B-Instruct"
    vllm_engine_config:
        max_num_seqs: 16
        max_model_len: 4096

# When creating a model archive manually:
torch-model-archiver --model-name my_model \
    --handler vllm_handler \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File	Lines	Description
`ts/torch_handler/vllm_handler.py`	L25-189	Full `VLLMHandler` class
`ts/torch_handler/vllm_handler.py`	L26-37	`__init__()` -- instance variable initialization
`ts/torch_handler/vllm_handler.py`	L39-91	`initialize(ctx)` -- engine and service creation
`ts/torch_handler/vllm_handler.py`	L93-106	`handle(data, context)` -- async request dispatch
`ts/torch_handler/vllm_handler.py`	L108-115	`preprocess(requests, context)` -- request extraction
`ts/torch_handler/vllm_handler.py`	L117-158	`inference(input_batch, context)` -- API routing and execution
`ts/torch_handler/vllm_handler.py`	L160-161	`postprocess(inference_outputs)` -- identity passthrough
`ts/torch_handler/vllm_handler.py`	L163-182	`_get_vllm_engine_config(handler_config)` -- YAML to AsyncEngineArgs

Signature

class VLLMHandler(BaseHandler):
    def __init__(self):
        """
        Initialize instance variables.

        Attributes:
            vllm_engine (AsyncLLMEngine|None): The vLLM async engine instance.
            model_name (str|None): Name of the served model.
            model_dir (str|None): Directory containing model artifacts.
            lora_ids (dict): Mapping of LoRA adapter identifiers.
            adapters (dict|None): Adapter name-to-path mapping from config.
            chat_completion_service (OpenAIServingChat|None): Chat completion endpoint.
            completion_service (OpenAIServingCompletion|None): Text completion endpoint.
            raw_request (MagicMock|None): Mock HTTP request for vLLM service compatibility.
            initialized (bool): Whether the handler has been initialized.
        """

    def initialize(self, ctx):
        """
        Initialize the vLLM engine and OpenAI-compatible services.

        1. Reads handler config from ctx.model_yaml_config
        2. Creates AsyncEngineArgs from vllm_engine_config
        3. Sets VLLM_WORKER_MULTIPROC_METHOD="spawn"
        4. Creates AsyncLLMEngine from engine args
        5. Loads LoRA adapters from handler.adapters config
        6. Creates OpenAIServingCompletion and OpenAIServingChat services
        7. Creates mock raw_request for vLLM service interface

        Parameters:
            ctx: TorchServe context object with system_properties and model_yaml_config.
        """

    async def handle(self, data, context):
        """
        Main entry point for request processing.

        Orchestrates preprocess -> inference -> postprocess pipeline.
        Records HandlerTime metric.

        Parameters:
            data (list): List of request data dictionaries.
            context: TorchServe context with metrics and request metadata.

        Returns:
            list: Processed inference results.
        """

    async def preprocess(self, requests, context):
        """
        Extract request body from TorchServe request envelope.

        Expects batch_size=1 (vLLM handles internal batching).
        Extracts data from 'data' or 'body' key and decodes bytes to string.

        Parameters:
            requests (list): List of request dictionaries.
            context: TorchServe context.

        Returns:
            list: Single-element list containing the parsed request data.
        """

    async def inference(self, input_batch, context):
        """
        Route request to appropriate vLLM service and execute inference.

        Routes based on url_path request header:
        - "v1/models" -> show_available_models()
        - "v1/completions" -> CompletionRequest -> create_completion()
        - "v1/chat/completions" -> ChatCompletionRequest -> create_chat_completion()

        For streaming requests (stream=true), sends intermediate responses
        via send_intermediate_predict_response().

        Parameters:
            input_batch (list): Single-element list with parsed request data.
            context: TorchServe context with request headers.

        Returns:
            list: Single-element list with response dict or final stream chunk.

        Raises:
            PredictionException: If url_path does not match any known endpoint (404).
        """

    async def postprocess(self, inference_outputs):
        """
        Identity passthrough - returns inference outputs unchanged.

        Parameters:
            inference_outputs (list): Results from inference().

        Returns:
            list: Same as input.
        """

Import

# Direct import (rarely used; handler is configured via model-config.yaml):
from ts.torch_handler.vllm_handler import VLLMHandler

# External dependencies imported by the handler:
from vllm import AsyncEngineArgs, AsyncLLMEngine
from vllm.entrypoints.openai.protocol import (
    ChatCompletionRequest,
    CompletionRequest,
    ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.entrypoints.openai.serving_engine import LoRAModulePath

# Internal TorchServe imports:
from ts.handler_utils.utils import send_intermediate_predict_response
from ts.service import PredictionException
from ts.torch_handler.base_handler import BaseHandler

I/O Contract

Direction	Type	Description
Input (handle)	list[dict]	List of request dicts, each with `"data"` or `"body"` key containing JSON string or bytes
Output (handle)	list	Single-element list with response dict (OpenAI format) or final stream chunk string
Input (initialize)	Context	TorchServe context with `system_properties["model_dir"]` and `model_yaml_config`
URL Routing	Header	`url_path` header determines endpoint: `"v1/chat/completions"`, `"v1/completions"`, or `"v1/models"`
Streaming	SSE	When `stream=true` in request, intermediate chunks sent via `send_intermediate_predict_response()`
Error	PredictionException	Raised with HTTP 404 for unknown API endpoints

Request Format (Chat Completions)

{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7,
    "stream": false
}

Request Format (Text Completions)

{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "prompt": "Explain continuous batching in LLM serving:",
    "max_tokens": 256,
    "temperature": 0.7
}

Response Format (Non-Streaming)

{
    "id": "cmpl-abc123",
    "object": "chat.completion",
    "created": 1700000000,
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "PagedAttention is a memory management technique..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 100,
        "total_tokens": 115
    }
}

Usage Examples

Example 1: Initialization Flow

The initialize() method constructs the full inference stack:

# Actual code from vllm_handler.py L39-91
def initialize(self, ctx):
    self.model_dir = ctx.system_properties.get("model_dir")
    vllm_engine_config = self._get_vllm_engine_config(
        ctx.model_yaml_config.get("handler", {})
    )

    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

    self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)

    self.adapters = ctx.model_yaml_config.get("handler", {}).get("adapters", {})
    lora_modules = [LoRAModulePath(n, p) for n, p in self.adapters.items()]

    if vllm_engine_config.served_model_name:
        served_model_names = vllm_engine_config.served_model_name
    else:
        served_model_names = [vllm_engine_config.model]

    # ... creates OpenAIServingCompletion and OpenAIServingChat services
    self.initialized = True

Example 2: Request Routing in inference()

The inference method uses a directory pattern to route requests:

# Actual code from vllm_handler.py L117-158
async def inference(self, input_batch, context):
    url_path = context.get_request_header(0, "url_path")

    if url_path == "v1/models":
        models = await self.chat_completion_service.show_available_models()
        return [models.model_dump()]

    directory = {
        "v1/completions": (
            CompletionRequest,
            self.completion_service,
            "create_completion",
        ),
        "v1/chat/completions": (
            ChatCompletionRequest,
            self.chat_completion_service,
            "create_chat_completion",
        ),
    }

    RequestType, service, func = directory.get(url_path, (None, None, None))

    if RequestType is None:
        raise PredictionException(f"Unknown API endpoint: {url_path}", 404)

    request = RequestType.model_validate(input_batch[0])
    g = await getattr(service, func)(request, self.raw_request)

    if isinstance(g, ErrorResponse):
        return [g.model_dump()]
    if request.stream:
        async for response in g:
            if response != "data: [DONE]\n\n":
                send_intermediate_predict_response(
                    [response], context.request_ids, "Result", 200, context
                )
        return [response]
    else:
        return [g.model_dump()]

Example 3: Sending a Chat Completion Request

# Non-streaming chat completion
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "messages": [{"role": "user", "content": "Hello, how are you?"}],
        "max_tokens": 100
    }'

# Streaming chat completion
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "messages": [{"role": "user", "content": "Write a short poem."}],
        "max_tokens": 200,
        "stream": true
    }'

# List available models
curl -X POST http://localhost:8080/predictions/model/v1/models

Related Pages

Principle:Pytorch_Serve_vLLM_Inference -- the theoretical basis for high-throughput LLM inference with continuous batching and async processing
Environment:Pytorch_Serve_vLLM_Engine_Environment - vLLM engine with AsyncLLMEngine and OpenAI API
Environment:Pytorch_Serve_CUDA_GPU_Environment - NVIDIA GPU required for vLLM inference
Heuristic:Pytorch_Serve_Batch_Size_Tuning - Must use batch_size=1; vLLM handles internal batching
Heuristic:Pytorch_Serve_LLM_Timeout_Configuration - Requires 1200s response/startup timeouts

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment