Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve VLLMHandler

From Leeroopedia
Field Value
Page Type Implementation
Implementation Type API Doc
Domains LLM_Serving, Inference
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

VLLMHandler is the TorchServe handler that integrates vLLM's asynchronous inference engine with TorchServe's model serving framework. It extends BaseHandler and provides async implementations of preprocess, inference, and postprocess methods. During initialization, it creates an AsyncLLMEngine and wraps it with OpenAIServingChat and OpenAIServingCompletion services to expose OpenAI-compatible API endpoints for both chat and text completion.

Description

The handler is the core bridge between TorchServe's request routing and vLLM's inference engine. It is configured via the handler section of model-config.yaml and is typically specified as vllm_handler in the model archive.

Key responsibilities:

  • Initialize the vLLM AsyncLLMEngine from YAML configuration
  • Create OpenAI-compatible serving endpoints (chat and completion)
  • Load LoRA adapters if configured
  • Route incoming requests to the correct service based on URL path
  • Support streaming responses via intermediate prediction responses

Usage

The handler is not typically imported directly. It is specified in the model configuration:

# In model-config.yaml, the handler is set automatically when using
# the vllm engine. For manual archive creation:
handler:
    model_path: "meta-llama/Meta-Llama-3.1-8B-Instruct"
    vllm_engine_config:
        max_num_seqs: 16
        max_model_len: 4096
# When creating a model archive manually:
torch-model-archiver --model-name my_model \
    --handler vllm_handler \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File Lines Description
ts/torch_handler/vllm_handler.py L25-189 Full VLLMHandler class
ts/torch_handler/vllm_handler.py L26-37 __init__() -- instance variable initialization
ts/torch_handler/vllm_handler.py L39-91 initialize(ctx) -- engine and service creation
ts/torch_handler/vllm_handler.py L93-106 handle(data, context) -- async request dispatch
ts/torch_handler/vllm_handler.py L108-115 preprocess(requests, context) -- request extraction
ts/torch_handler/vllm_handler.py L117-158 inference(input_batch, context) -- API routing and execution
ts/torch_handler/vllm_handler.py L160-161 postprocess(inference_outputs) -- identity passthrough
ts/torch_handler/vllm_handler.py L163-182 _get_vllm_engine_config(handler_config) -- YAML to AsyncEngineArgs

Signature

class VLLMHandler(BaseHandler):
    def __init__(self):
        """
        Initialize instance variables.

        Attributes:
            vllm_engine (AsyncLLMEngine|None): The vLLM async engine instance.
            model_name (str|None): Name of the served model.
            model_dir (str|None): Directory containing model artifacts.
            lora_ids (dict): Mapping of LoRA adapter identifiers.
            adapters (dict|None): Adapter name-to-path mapping from config.
            chat_completion_service (OpenAIServingChat|None): Chat completion endpoint.
            completion_service (OpenAIServingCompletion|None): Text completion endpoint.
            raw_request (MagicMock|None): Mock HTTP request for vLLM service compatibility.
            initialized (bool): Whether the handler has been initialized.
        """

    def initialize(self, ctx):
        """
        Initialize the vLLM engine and OpenAI-compatible services.

        1. Reads handler config from ctx.model_yaml_config
        2. Creates AsyncEngineArgs from vllm_engine_config
        3. Sets VLLM_WORKER_MULTIPROC_METHOD="spawn"
        4. Creates AsyncLLMEngine from engine args
        5. Loads LoRA adapters from handler.adapters config
        6. Creates OpenAIServingCompletion and OpenAIServingChat services
        7. Creates mock raw_request for vLLM service interface

        Parameters:
            ctx: TorchServe context object with system_properties and model_yaml_config.
        """

    async def handle(self, data, context):
        """
        Main entry point for request processing.

        Orchestrates preprocess -> inference -> postprocess pipeline.
        Records HandlerTime metric.

        Parameters:
            data (list): List of request data dictionaries.
            context: TorchServe context with metrics and request metadata.

        Returns:
            list: Processed inference results.
        """

    async def preprocess(self, requests, context):
        """
        Extract request body from TorchServe request envelope.

        Expects batch_size=1 (vLLM handles internal batching).
        Extracts data from 'data' or 'body' key and decodes bytes to string.

        Parameters:
            requests (list): List of request dictionaries.
            context: TorchServe context.

        Returns:
            list: Single-element list containing the parsed request data.
        """

    async def inference(self, input_batch, context):
        """
        Route request to appropriate vLLM service and execute inference.

        Routes based on url_path request header:
        - "v1/models" -> show_available_models()
        - "v1/completions" -> CompletionRequest -> create_completion()
        - "v1/chat/completions" -> ChatCompletionRequest -> create_chat_completion()

        For streaming requests (stream=true), sends intermediate responses
        via send_intermediate_predict_response().

        Parameters:
            input_batch (list): Single-element list with parsed request data.
            context: TorchServe context with request headers.

        Returns:
            list: Single-element list with response dict or final stream chunk.

        Raises:
            PredictionException: If url_path does not match any known endpoint (404).
        """

    async def postprocess(self, inference_outputs):
        """
        Identity passthrough - returns inference outputs unchanged.

        Parameters:
            inference_outputs (list): Results from inference().

        Returns:
            list: Same as input.
        """

Import

# Direct import (rarely used; handler is configured via model-config.yaml):
from ts.torch_handler.vllm_handler import VLLMHandler

# External dependencies imported by the handler:
from vllm import AsyncEngineArgs, AsyncLLMEngine
from vllm.entrypoints.openai.protocol import (
    ChatCompletionRequest,
    CompletionRequest,
    ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.entrypoints.openai.serving_engine import LoRAModulePath

# Internal TorchServe imports:
from ts.handler_utils.utils import send_intermediate_predict_response
from ts.service import PredictionException
from ts.torch_handler.base_handler import BaseHandler

I/O Contract

Direction Type Description
Input (handle) list[dict] List of request dicts, each with "data" or "body" key containing JSON string or bytes
Output (handle) list Single-element list with response dict (OpenAI format) or final stream chunk string
Input (initialize) Context TorchServe context with system_properties["model_dir"] and model_yaml_config
URL Routing Header url_path header determines endpoint: "v1/chat/completions", "v1/completions", or "v1/models"
Streaming SSE When stream=true in request, intermediate chunks sent via send_intermediate_predict_response()
Error PredictionException Raised with HTTP 404 for unknown API endpoints

Request Format (Chat Completions)

{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7,
    "stream": false
}

Request Format (Text Completions)

{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "prompt": "Explain continuous batching in LLM serving:",
    "max_tokens": 256,
    "temperature": 0.7
}

Response Format (Non-Streaming)

{
    "id": "cmpl-abc123",
    "object": "chat.completion",
    "created": 1700000000,
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "PagedAttention is a memory management technique..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 100,
        "total_tokens": 115
    }
}

Usage Examples

Example 1: Initialization Flow

The initialize() method constructs the full inference stack:

# Actual code from vllm_handler.py L39-91
def initialize(self, ctx):
    self.model_dir = ctx.system_properties.get("model_dir")
    vllm_engine_config = self._get_vllm_engine_config(
        ctx.model_yaml_config.get("handler", {})
    )

    os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

    self.vllm_engine = AsyncLLMEngine.from_engine_args(vllm_engine_config)

    self.adapters = ctx.model_yaml_config.get("handler", {}).get("adapters", {})
    lora_modules = [LoRAModulePath(n, p) for n, p in self.adapters.items()]

    if vllm_engine_config.served_model_name:
        served_model_names = vllm_engine_config.served_model_name
    else:
        served_model_names = [vllm_engine_config.model]

    # ... creates OpenAIServingCompletion and OpenAIServingChat services
    self.initialized = True

Example 2: Request Routing in inference()

The inference method uses a directory pattern to route requests:

# Actual code from vllm_handler.py L117-158
async def inference(self, input_batch, context):
    url_path = context.get_request_header(0, "url_path")

    if url_path == "v1/models":
        models = await self.chat_completion_service.show_available_models()
        return [models.model_dump()]

    directory = {
        "v1/completions": (
            CompletionRequest,
            self.completion_service,
            "create_completion",
        ),
        "v1/chat/completions": (
            ChatCompletionRequest,
            self.chat_completion_service,
            "create_chat_completion",
        ),
    }

    RequestType, service, func = directory.get(url_path, (None, None, None))

    if RequestType is None:
        raise PredictionException(f"Unknown API endpoint: {url_path}", 404)

    request = RequestType.model_validate(input_batch[0])
    g = await getattr(service, func)(request, self.raw_request)

    if isinstance(g, ErrorResponse):
        return [g.model_dump()]
    if request.stream:
        async for response in g:
            if response != "data: [DONE]\n\n":
                send_intermediate_predict_response(
                    [response], context.request_ids, "Result", 200, context
                )
        return [response]
    else:
        return [g.model_dump()]

Example 3: Sending a Chat Completion Request

# Non-streaming chat completion
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "messages": [{"role": "user", "content": "Hello, how are you?"}],
        "max_tokens": 100
    }'

# Streaming chat completion
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B",
        "messages": [{"role": "user", "content": "Write a short poem."}],
        "max_tokens": 200,
        "stream": true
    }'

# List available models
curl -X POST http://localhost:8080/predictions/model/v1/models

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment