Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Pytorch Serve LLM Deployment vLLM

From Leeroopedia
Knowledge Sources
Domains LLMs, Model_Serving, Inference
Last Updated 2026-02-13 18:00 GMT

Overview

End-to-end process for deploying large language models using TorchServe's vLLM integration with a single command, achieving high-throughput inference via PagedAttention and continuous batching.

Description

This workflow enables rapid LLM deployment by leveraging TorchServe's built-in llm_launcher module and vLLM engine integration. It supports any HuggingFace-hosted model compatible with vLLM, providing continuous batching, PagedAttention for efficient memory management, and optional tensor parallelism across multiple GPUs. The workflow can run both natively and inside a Docker container, with an OpenAI-compatible API endpoint.

Usage

Execute this workflow when you need to quickly deploy a large language model (e.g., Llama-3, Mistral) for text generation with minimal configuration. This is ideal for teams that need production-ready LLM serving without writing custom handlers or manually managing model archives.

Execution Steps

Step 1: Install TorchServe and vLLM

Install TorchServe and its companion tools (torch-model-archiver, torch-workflow-archiver) along with the vLLM package. For GPU support, ensure CUDA drivers and the appropriate PyTorch CUDA build are installed. Authenticate with HuggingFace Hub if deploying gated models.

Key considerations:

  • vLLM requires GPU hardware with sufficient VRAM for the target model
  • Use huggingface-cli login for gated models like Llama-3
  • For Docker deployment, build the vLLM-specific Dockerfile provided in the repository

Step 2: Select and Configure the Model

Choose a HuggingFace model identifier supported by vLLM. Configure optional parameters such as tensor parallel size for multi-GPU deployment, maximum model length, and quantization settings. These can be specified via command-line arguments or a model-config.yaml file.

Key considerations:

  • Supported models include Llama, Mistral, GPT-NeoX, Falcon, and others listed in vLLM documentation
  • Set CUDA_VISIBLE_DEVICES to control which GPUs are used
  • For distributed inference, set tensor_parallel_size to the number of GPUs

Step 3: Launch TorchServe with LLM Launcher

Execute the llm_launcher module which automatically handles model archive creation, TorchServe server startup, and model registration in a single step. The launcher configures the vLLM engine backend, sets up continuous batching, and exposes the inference endpoint.

Pseudocode:

# Native launch
python -m ts.llm_launcher --model_id <hf_model_id> --disable_token_auth
# Docker launch
docker run --gpus all -p 8080:8080 ts/vllm --model_id <hf_model_id>

Step 4: Run Inference

Send text generation requests to the TorchServe inference endpoint. The vLLM engine handles continuous batching, automatically managing multiple concurrent requests. Requests support vLLM SamplingParams for controlling generation behavior (temperature, top_p, max_tokens, etc.).

Key considerations:

  • The endpoint supports both TorchServe's native format and OpenAI-compatible completion format
  • Streaming responses are available via HTTP chunked encoding and gRPC server-side streaming
  • New requests are added to the engine continuously without waiting for current batch completion

Step 5: Monitor and Scale

Monitor inference throughput and latency via TorchServe's metrics API (port 8082) which exports Prometheus-compatible metrics. Scale horizontally by deploying additional TorchServe instances behind a load balancer, or scale vertically by increasing tensor parallelism across more GPUs.

Key considerations:

  • Use a single worker per vLLM engine for optimal hardware utilization
  • Tensor parallelism distributes the model across GPUs within a single worker
  • The job ticket feature enables latency-sensitive routing by rejecting requests when all workers are busy

Execution Diagram

GitHub URL

Workflow Repository