Workflow:Pytorch Serve LLM Deployment vLLM

Knowledge Sources	TorchServe vLLM Documentation TorchServe LLM Deployment
Domains	LLMs, Model_Serving, Inference
Last Updated	2026-02-13 18:00 GMT

Overview

End-to-end process for deploying large language models using TorchServe's vLLM integration with a single command, achieving high-throughput inference via PagedAttention and continuous batching.

Description

This workflow enables rapid LLM deployment by leveraging TorchServe's built-in llm_launcher module and vLLM engine integration. It supports any HuggingFace-hosted model compatible with vLLM, providing continuous batching, PagedAttention for efficient memory management, and optional tensor parallelism across multiple GPUs. The workflow can run both natively and inside a Docker container, with an OpenAI-compatible API endpoint.

Usage

Execute this workflow when you need to quickly deploy a large language model (e.g., Llama-3, Mistral) for text generation with minimal configuration. This is ideal for teams that need production-ready LLM serving without writing custom handlers or manually managing model archives.

Execution Steps

Step 1: Install TorchServe and vLLM

Install TorchServe and its companion tools (torch-model-archiver, torch-workflow-archiver) along with the vLLM package. For GPU support, ensure CUDA drivers and the appropriate PyTorch CUDA build are installed. Authenticate with HuggingFace Hub if deploying gated models.

Key considerations:

vLLM requires GPU hardware with sufficient VRAM for the target model
Use huggingface-cli login for gated models like Llama-3
For Docker deployment, build the vLLM-specific Dockerfile provided in the repository

Step 2: Select and Configure the Model

Choose a HuggingFace model identifier supported by vLLM. Configure optional parameters such as tensor parallel size for multi-GPU deployment, maximum model length, and quantization settings. These can be specified via command-line arguments or a model-config.yaml file.

Key considerations:

Supported models include Llama, Mistral, GPT-NeoX, Falcon, and others listed in vLLM documentation
Set CUDA_VISIBLE_DEVICES to control which GPUs are used
For distributed inference, set tensor_parallel_size to the number of GPUs

Step 3: Launch TorchServe with LLM Launcher

Execute the llm_launcher module which automatically handles model archive creation, TorchServe server startup, and model registration in a single step. The launcher configures the vLLM engine backend, sets up continuous batching, and exposes the inference endpoint.

Pseudocode:

# Native launch
python -m ts.llm_launcher --model_id <hf_model_id> --disable_token_auth

# Docker launch
docker run --gpus all -p 8080:8080 ts/vllm --model_id <hf_model_id>

Step 4: Run Inference

Send text generation requests to the TorchServe inference endpoint. The vLLM engine handles continuous batching, automatically managing multiple concurrent requests. Requests support vLLM SamplingParams for controlling generation behavior (temperature, top_p, max_tokens, etc.).

Key considerations:

The endpoint supports both TorchServe's native format and OpenAI-compatible completion format
Streaming responses are available via HTTP chunked encoding and gRPC server-side streaming
New requests are added to the engine continuously without waiting for current batch completion

Step 5: Monitor and Scale

Monitor inference throughput and latency via TorchServe's metrics API (port 8082) which exports Prometheus-compatible metrics. Scale horizontally by deploying additional TorchServe instances behind a load balancer, or scale vertically by increasing tensor parallelism across more GPUs.

Key considerations:

Use a single worker per vLLM engine for optimal hardware utilization
Tensor parallelism distributes the model across GPUs within a single worker
The job ticket feature enables latency-sensitive routing by rejecting requests when all workers are busy

Execution Diagram

GitHub URL

Workflow Repository