Implementation:Pytorch Serve LLM Launcher Main
| Field | Value |
|---|---|
| Page Type | Implementation |
| Implementation Type | API Doc |
| Domains | LLM_Serving, Automation |
| Knowledge Sources | TorchServe |
| Workflow | LLM_Deployment_vLLM |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The LLM Launcher (ts/llm_launcher.py) is a CLI tool that automates the full lifecycle of deploying an LLM on TorchServe: generating model configuration, creating a model archive, starting the server, and cleaning up on shutdown. It supports both vLLM and TensorRT-LLM backends, with the default engine being vLLM and the default model being meta-llama/Meta-Llama-3.1-8B-Instruct.
Description
The launcher module contains three primary functions that form a pipeline:
get_model_config(args)-- generates a model configuration dictionary from CLI arguments and hardware introspectioncreate_mar_file(args)-- serializes the configuration to YAML, creates a model archive, and manages cleanup via context managermain(args)-- orchestrates the full deployment: creates the model store directory, builds the archive, starts TorchServe, and waits for termination
Usage
# Basic launch with defaults (Llama 3.1 8B, vLLM engine)
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine vllm
# Custom batch size and context length
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
--engine vllm \
--vllm_engine.max_num_seqs 512 \
--vllm_engine.max_model_len 4096
# Disable token authentication for local development
python -m ts.llm_launcher --model_id mistralai/Mistral-7B-v0.1 \
--engine vllm \
--disable_token_auth
# Custom model store and model name
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
--model_name llama3 \
--model_store /data/model_store
Code Reference
Source Location
| File | Lines | Function |
|---|---|---|
ts/llm_launcher.py |
L164-197 | main(args) -- orchestration entry point
|
ts/llm_launcher.py |
L130-161 | create_mar_file(args, model_snapshot_path=None) -- archive creation context manager
|
ts/llm_launcher.py |
L63-127 | get_model_config(args, model_snapshot_path=None) -- configuration generation
|
ts/llm_launcher.py |
L200-287 | argparse argument definitions |
Signature
def main(args):
"""
Register the model in torchserve.
Orchestrates the full LLM deployment lifecycle:
1. Creates model store directory
2. Downloads model (for TRT-LLM only; vLLM downloads on engine init)
3. Creates model archive via create_mar_file context manager
4. Starts TorchServe with the model pre-registered
5. Blocks until KeyboardInterrupt (SIGINT)
6. Stops TorchServe and cleans up the archive
Parameters:
args (argparse.Namespace): Parsed CLI arguments including model_id,
engine, model_store, model_name, and engine-specific parameters.
"""
@contextlib.contextmanager
def create_mar_file(args, model_snapshot_path=None):
"""
Context manager that creates a model archive and cleans up on exit.
1. Generates model-config.yaml from get_model_config()
2. Creates a no-archive format MAR using ModelArchiverConfig
3. Yields the MAR file path
4. On exit, removes the MAR directory (for vLLM engine)
Parameters:
args (argparse.Namespace): Parsed CLI arguments.
model_snapshot_path (str|None): Local path to downloaded model snapshot
(used by TRT-LLM; None for vLLM).
Yields:
str: Path to the created model archive directory.
"""
def get_model_config(args, model_snapshot_path=None):
"""
Generate model configuration dictionary for TorchServe.
For vLLM engine, auto-detects GPU count via torch.cuda.device_count()
and sets tensor_parallel_size accordingly. Constructs the handler
configuration with vllm_engine_config parameters.
Parameters:
args (argparse.Namespace): Parsed CLI arguments.
model_snapshot_path (str|None): Local model path (for TRT-LLM).
Returns:
dict: Model configuration suitable for serialization to YAML.
Keys include: minWorkers, maxWorkers, batchSize, maxBatchDelay,
responseTimeout, startupTimeout, deviceType, asyncCommunication,
parallelLevel, handler (with model_path and vllm_engine_config).
"""
Import
# The launcher is invoked as a module:
# python -m ts.llm_launcher [OPTIONS]
# Internal imports used by the module:
from model_archiver import ModelArchiverConfig
from model_archiver.model_packaging import generate_model_archive
from ts.launcher import start, stop
from ts.utils.hf_utils import download_model
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | CLI arguments | Model ID, engine type, engine-specific parameters (see argument table below) |
| Output | Running server | TorchServe process listening on default ports (8080 inference, 8081 management, 8082 metrics) |
| Side Effect | File system | Model archive directory created in --model_store path; cleaned up on vLLM exit
|
| Precondition | Environment | PyTorch, TorchServe, vLLM installed; GPU(s) available; model accessible (HuggingFace credentials if gated model) |
| Postcondition | Server state | Model registered and serving; server blocks until SIGINT |
CLI Argument Reference
| Argument | Type | Default | Description |
|---|---|---|---|
--model_name |
str | "model" | Name for the registered model |
--model_store |
str | "model_store" | Directory for model archives |
--model_id |
str | "meta-llama/Meta-Llama-3.1-8B-Instruct" | HuggingFace model ID or local path |
--disable_token_auth |
flag | false | Disable TorchServe token authentication |
--vllm_engine.max_num_seqs |
int | 256 | Maximum concurrent sequences in vLLM batch |
--vllm_engine.max_model_len |
int | None (model default) | Maximum context length in tokens |
--vllm_engine.download_dir |
str | None | Custom model download/cache directory |
--startup_timeout |
int | 1200 | Model startup timeout in seconds |
--engine |
str | "vllm" | LLM engine backend (vllm or trt_llm) |
--dtype |
str | "bfloat16" | Data type for model weights |
Usage Examples
Example 1: Default Launch (Llama 3.1 8B with vLLM)
python -m ts.llm_launcher
This uses all defaults:
- Model:
meta-llama/Meta-Llama-3.1-8B-Instruct - Engine: vLLM
- max_num_seqs: 256
- tensor_parallel_size: auto-detected from GPU count
The generated model configuration (internal) will be:
{
"minWorkers": 1,
"maxWorkers": 1,
"batchSize": 1,
"maxBatchDelay": 100,
"responseTimeout": 1200,
"startupTimeout": 1200,
"deviceType": "gpu",
"asyncCommunication": True,
"parallelLevel": torch.cuda.device_count(), # e.g., 4
"handler": {
"model_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"vllm_engine_config": {
"max_num_seqs": 256,
"max_model_len": None,
"download_dir": None,
"tensor_parallel_size": torch.cuda.device_count(), # e.g., 4
},
},
}
Example 2: Server Lifecycle
The main() function manages the complete lifecycle:
def main(args):
# 1. Create model store directory
model_store_path = Path(args.model_store)
model_store_path.mkdir(parents=True, exist_ok=True)
# 2. For vLLM, no pre-download needed (engine handles it)
model_snapshot_path = None
# 3. Create archive, start server, wait for interrupt
with create_mar_file(args, model_snapshot_path):
try:
start(
model_store=args.model_store,
no_config_snapshots=True,
models=args.model_name,
disable_token=args.disable_token_auth,
)
pause() # Block until SIGINT
except KeyboardInterrupt:
pass
finally:
stop(wait=False) # Shut down TorchServe
# Context manager cleans up MAR directory for vLLM
Example 3: Testing the Endpoint After Launch
Once the launcher is running, test the endpoint with:
# Chat completions (OpenAI-compatible)
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Text completions
curl -X POST http://localhost:8080/predictions/model/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Once upon a time",
"max_tokens": 100
}'
Related Pages
- Principle:Pytorch_Serve_LLM_Quick_Start -- the design principles behind single-command LLM deployment
- Environment:Pytorch_Serve_vLLM_Engine_Environment - vLLM engine environment (when engine=vllm)
- Environment:Pytorch_Serve_CUDA_GPU_Environment - GPU environment for LLM inference
- Heuristic:Pytorch_Serve_Batch_Size_Tuning - LLM batch_size=1 with internal batching via max_num_seqs
- Heuristic:Pytorch_Serve_LLM_Timeout_Configuration - 1200s timeout and async communication defaults