Workflow:Pytorch Serve LLM Deployment vLLM
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Serving, Inference |
| Last Updated | 2026-02-13 18:00 GMT |
Overview
End-to-end process for deploying large language models using TorchServe's vLLM integration with a single command, achieving high-throughput inference via PagedAttention and continuous batching.
Description
This workflow enables rapid LLM deployment by leveraging TorchServe's built-in llm_launcher module and vLLM engine integration. It supports any HuggingFace-hosted model compatible with vLLM, providing continuous batching, PagedAttention for efficient memory management, and optional tensor parallelism across multiple GPUs. The workflow can run both natively and inside a Docker container, with an OpenAI-compatible API endpoint.
Usage
Execute this workflow when you need to quickly deploy a large language model (e.g., Llama-3, Mistral) for text generation with minimal configuration. This is ideal for teams that need production-ready LLM serving without writing custom handlers or manually managing model archives.
Execution Steps
Step 1: Install TorchServe and vLLM
Install TorchServe and its companion tools (torch-model-archiver, torch-workflow-archiver) along with the vLLM package. For GPU support, ensure CUDA drivers and the appropriate PyTorch CUDA build are installed. Authenticate with HuggingFace Hub if deploying gated models.
Key considerations:
- vLLM requires GPU hardware with sufficient VRAM for the target model
- Use huggingface-cli login for gated models like Llama-3
- For Docker deployment, build the vLLM-specific Dockerfile provided in the repository
Step 2: Select and Configure the Model
Choose a HuggingFace model identifier supported by vLLM. Configure optional parameters such as tensor parallel size for multi-GPU deployment, maximum model length, and quantization settings. These can be specified via command-line arguments or a model-config.yaml file.
Key considerations:
- Supported models include Llama, Mistral, GPT-NeoX, Falcon, and others listed in vLLM documentation
- Set CUDA_VISIBLE_DEVICES to control which GPUs are used
- For distributed inference, set tensor_parallel_size to the number of GPUs
Step 3: Launch TorchServe with LLM Launcher
Execute the llm_launcher module which automatically handles model archive creation, TorchServe server startup, and model registration in a single step. The launcher configures the vLLM engine backend, sets up continuous batching, and exposes the inference endpoint.
Pseudocode:
# Native launch python -m ts.llm_launcher --model_id <hf_model_id> --disable_token_auth
# Docker launch docker run --gpus all -p 8080:8080 ts/vllm --model_id <hf_model_id>
Step 4: Run Inference
Send text generation requests to the TorchServe inference endpoint. The vLLM engine handles continuous batching, automatically managing multiple concurrent requests. Requests support vLLM SamplingParams for controlling generation behavior (temperature, top_p, max_tokens, etc.).
Key considerations:
- The endpoint supports both TorchServe's native format and OpenAI-compatible completion format
- Streaming responses are available via HTTP chunked encoding and gRPC server-side streaming
- New requests are added to the engine continuously without waiting for current batch completion
Step 5: Monitor and Scale
Monitor inference throughput and latency via TorchServe's metrics API (port 8082) which exports Prometheus-compatible metrics. Scale horizontally by deploying additional TorchServe instances behind a load balancer, or scale vertically by increasing tensor parallelism across more GPUs.
Key considerations:
- Use a single worker per vLLM engine for optimal hardware utilization
- Tensor parallelism distributes the model across GPUs within a single worker
- The job ticket feature enables latency-sensitive routing by rejecting requests when all workers are busy