Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve vLLM Inference

From Leeroopedia
Field Value
Page Type Principle
Domains LLM_Serving, Inference
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

High-throughput LLM inference with continuous batching uses vLLM's asynchronous engine for parallel request processing, OpenAI-compatible API endpoints, and streaming responses. The integration between TorchServe and vLLM bridges TorchServe's model management capabilities (lifecycle, scaling, monitoring) with vLLM's optimized inference engine (PagedAttention, continuous batching, tensor parallelism), exposing a unified serving interface that supports both standard TorchServe prediction endpoints and OpenAI-compatible chat/completion APIs.

Description

Asynchronous Inference Pipeline

The vLLM inference pipeline within TorchServe is fully asynchronous, using Python's asyncio to handle concurrent requests without blocking. The pipeline follows the standard TorchServe handler pattern (preprocess, inference, postprocess) but with async implementations:

1. Preprocess -- extracts the request body from TorchServe's request envelope. The handler expects a batch size of 1 at the TorchServe level because vLLM manages its own internal batching via continuous batching. The raw request data (JSON) is decoded and passed through.

2. Inference -- routes the request to the appropriate vLLM service based on the URL path:

  • v1/chat/completions routes to OpenAIServingChat.create_chat_completion()
  • v1/completions routes to OpenAIServingCompletion.create_completion()
  • v1/models returns the list of available models

3. Postprocess -- passes inference outputs through unchanged (identity function), as vLLM's OpenAI-compatible services already produce correctly formatted responses.

Continuous Batching

Unlike static batching, where requests are grouped into fixed-size batches before inference begins, continuous batching (also called iteration-level batching) allows new requests to enter the processing pipeline at each token generation step. This is critical for LLM serving because:

  • Autoregressive generation produces tokens one at a time, and different requests finish at different times
  • Static batching would force all requests in a batch to wait for the longest sequence, wasting GPU cycles
  • Continuous batching fills the freed slots immediately with waiting requests, maintaining high GPU utilization

The max_num_seqs parameter controls the maximum number of sequences that can be in-flight simultaneously. vLLM's scheduler manages the admission and preemption of sequences based on available KV-cache memory.

OpenAI-Compatible API

The vLLM handler exposes an OpenAI-compatible API through TorchServe's prediction endpoint. Requests are routed based on the URL path suffix:

  • /predictions/{model_name}/v1/chat/completions -- Chat Completion API (messages-based)
  • /predictions/{model_name}/v1/completions -- Completion API (prompt-based)
  • /predictions/{model_name}/v1/models -- List available models

This compatibility allows existing OpenAI client libraries and applications to target TorchServe with minimal code changes, requiring only a base URL update.

Streaming Responses

When "stream": true is set in the request, the handler uses TorchServe's send_intermediate_predict_response() to emit Server-Sent Events (SSE) as tokens are generated. Each chunk is sent to the client as it becomes available, rather than waiting for the full response to complete. This provides:

  • Lower time-to-first-token (TTFT) -- the client receives the first token as soon as it is generated
  • Progressive rendering -- chat interfaces can display tokens as they arrive
  • Connection efficiency -- long-running generations maintain a single HTTP connection

LoRA Adapter Routing

When LoRA is enabled, the vLLM engine can dynamically apply different fine-tuned adapters per request. The adapters are loaded during initialization based on the handler.adapters configuration. At inference time, the request can specify which adapter to use via the model name field, enabling multi-tenant serving from a single base model instance.

Usage

The vLLM inference pipeline is used whenever a request arrives at a TorchServe endpoint that has been configured with the vLLM handler. The typical request flow is:

  1. Client sends an HTTP POST to /predictions/{model_name}/v1/chat/completions
  2. TorchServe's Java frontend routes the request to a Python worker running VLLMHandler
  3. The handler's async handle() method processes the request through preprocess, inference, and postprocess
  4. For non-streaming requests, the complete response is returned as JSON
  5. For streaming requests, intermediate chunks are sent via SSE, followed by a final [DONE] sentinel

Theoretical Basis

PagedAttention

vLLM's core innovation is PagedAttention, which manages the key-value (KV) cache using a paging mechanism inspired by operating system virtual memory. Instead of pre-allocating a contiguous block of GPU memory for each sequence's KV cache, PagedAttention:

  • Divides the KV cache into fixed-size pages (blocks)
  • Allocates pages on demand as the sequence grows
  • Allows non-contiguous storage of a single sequence's KV cache
  • Enables efficient memory sharing between sequences (e.g., for beam search or shared prefixes)

This approach reduces GPU memory waste from internal fragmentation by up to 90%, enabling higher batch sizes and throughput compared to naive KV cache allocation.

Async Engine Architecture

The AsyncLLMEngine decouples request submission from result collection. Requests are submitted to the engine, which schedules them across available GPU resources. The engine runs a continuous loop that:

  1. Selects which sequences to process in the next iteration (scheduling)
  2. Executes a forward pass for all selected sequences simultaneously
  3. Updates the KV cache and token positions
  4. Yields completed tokens for sequences that have finished a generation step

This architecture allows the engine to maximize GPU utilization by always having work available, even as individual sequences complete and new ones arrive.

Request Routing Pattern

The handler uses a directory pattern to route requests to the appropriate service based on the URL path. This is a form of the strategy pattern where the URL path selects the processing strategy (chat completion vs. text completion). The pattern is extensible -- new endpoints can be added by extending the directory dictionary.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment