Implementation:Pytorch Serve LLM Launcher Main

Field	Value
Page Type	Implementation
Implementation Type	API Doc
Domains	LLM_Serving, Automation
Knowledge Sources	TorchServe
Workflow	LLM_Deployment_vLLM
Last Updated	2026-02-13 00:00 GMT

Overview

The LLM Launcher (ts/llm_launcher.py) is a CLI tool that automates the full lifecycle of deploying an LLM on TorchServe: generating model configuration, creating a model archive, starting the server, and cleaning up on shutdown. It supports both vLLM and TensorRT-LLM backends, with the default engine being vLLM and the default model being meta-llama/Meta-Llama-3.1-8B-Instruct.

Description

The launcher module contains three primary functions that form a pipeline:

get_model_config(args) -- generates a model configuration dictionary from CLI arguments and hardware introspection
create_mar_file(args) -- serializes the configuration to YAML, creates a model archive, and manages cleanup via context manager
main(args) -- orchestrates the full deployment: creates the model store directory, builds the archive, starts TorchServe, and waits for termination

Usage

# Basic launch with defaults (Llama 3.1 8B, vLLM engine)
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine vllm

# Custom batch size and context length
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --engine vllm \
    --vllm_engine.max_num_seqs 512 \
    --vllm_engine.max_model_len 4096

# Disable token authentication for local development
python -m ts.llm_launcher --model_id mistralai/Mistral-7B-v0.1 \
    --engine vllm \
    --disable_token_auth

# Custom model store and model name
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --model_name llama3 \
    --model_store /data/model_store

Code Reference

Source Location

File	Lines	Function
`ts/llm_launcher.py`	L164-197	`main(args)` -- orchestration entry point
`ts/llm_launcher.py`	L130-161	`create_mar_file(args, model_snapshot_path=None)` -- archive creation context manager
`ts/llm_launcher.py`	L63-127	`get_model_config(args, model_snapshot_path=None)` -- configuration generation
`ts/llm_launcher.py`	L200-287	argparse argument definitions

Signature

def main(args):
    """
    Register the model in torchserve.

    Orchestrates the full LLM deployment lifecycle:
    1. Creates model store directory
    2. Downloads model (for TRT-LLM only; vLLM downloads on engine init)
    3. Creates model archive via create_mar_file context manager
    4. Starts TorchServe with the model pre-registered
    5. Blocks until KeyboardInterrupt (SIGINT)
    6. Stops TorchServe and cleans up the archive

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments including model_id,
            engine, model_store, model_name, and engine-specific parameters.
    """

@contextlib.contextmanager
def create_mar_file(args, model_snapshot_path=None):
    """
    Context manager that creates a model archive and cleans up on exit.

    1. Generates model-config.yaml from get_model_config()
    2. Creates a no-archive format MAR using ModelArchiverConfig
    3. Yields the MAR file path
    4. On exit, removes the MAR directory (for vLLM engine)

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments.
        model_snapshot_path (str|None): Local path to downloaded model snapshot
            (used by TRT-LLM; None for vLLM).

    Yields:
        str: Path to the created model archive directory.
    """

def get_model_config(args, model_snapshot_path=None):
    """
    Generate model configuration dictionary for TorchServe.

    For vLLM engine, auto-detects GPU count via torch.cuda.device_count()
    and sets tensor_parallel_size accordingly. Constructs the handler
    configuration with vllm_engine_config parameters.

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments.
        model_snapshot_path (str|None): Local model path (for TRT-LLM).

    Returns:
        dict: Model configuration suitable for serialization to YAML.
            Keys include: minWorkers, maxWorkers, batchSize, maxBatchDelay,
            responseTimeout, startupTimeout, deviceType, asyncCommunication,
            parallelLevel, handler (with model_path and vllm_engine_config).
    """

Import

# The launcher is invoked as a module:
# python -m ts.llm_launcher [OPTIONS]

# Internal imports used by the module:
from model_archiver import ModelArchiverConfig
from model_archiver.model_packaging import generate_model_archive
from ts.launcher import start, stop
from ts.utils.hf_utils import download_model

I/O Contract

Direction	Type	Description
Input	CLI arguments	Model ID, engine type, engine-specific parameters (see argument table below)
Output	Running server	TorchServe process listening on default ports (8080 inference, 8081 management, 8082 metrics)
Side Effect	File system	Model archive directory created in `--model_store` path; cleaned up on vLLM exit
Precondition	Environment	PyTorch, TorchServe, vLLM installed; GPU(s) available; model accessible (HuggingFace credentials if gated model)
Postcondition	Server state	Model registered and serving; server blocks until SIGINT

CLI Argument Reference

Argument	Type	Default	Description
`--model_name`	str	"model"	Name for the registered model
`--model_store`	str	"model_store"	Directory for model archives
`--model_id`	str	"meta-llama/Meta-Llama-3.1-8B-Instruct"	HuggingFace model ID or local path
`--disable_token_auth`	flag	false	Disable TorchServe token authentication
`--vllm_engine.max_num_seqs`	int	256	Maximum concurrent sequences in vLLM batch
`--vllm_engine.max_model_len`	int	None (model default)	Maximum context length in tokens
`--vllm_engine.download_dir`	str	None	Custom model download/cache directory
`--startup_timeout`	int	1200	Model startup timeout in seconds
`--engine`	str	"vllm"	LLM engine backend (vllm or trt_llm)
`--dtype`	str	"bfloat16"	Data type for model weights

Usage Examples

Example 1: Default Launch (Llama 3.1 8B with vLLM)

python -m ts.llm_launcher

This uses all defaults:

Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Engine: vLLM
max_num_seqs: 256
tensor_parallel_size: auto-detected from GPU count

The generated model configuration (internal) will be:

{
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "responseTimeout": 1200,
    "startupTimeout": 1200,
    "deviceType": "gpu",
    "asyncCommunication": True,
    "parallelLevel": torch.cuda.device_count(),  # e.g., 4
    "handler": {
        "model_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "vllm_engine_config": {
            "max_num_seqs": 256,
            "max_model_len": None,
            "download_dir": None,
            "tensor_parallel_size": torch.cuda.device_count(),  # e.g., 4
        },
    },
}

Example 2: Server Lifecycle

The main() function manages the complete lifecycle:

def main(args):
    # 1. Create model store directory
    model_store_path = Path(args.model_store)
    model_store_path.mkdir(parents=True, exist_ok=True)

    # 2. For vLLM, no pre-download needed (engine handles it)
    model_snapshot_path = None

    # 3. Create archive, start server, wait for interrupt
    with create_mar_file(args, model_snapshot_path):
        try:
            start(
                model_store=args.model_store,
                no_config_snapshots=True,
                models=args.model_name,
                disable_token=args.disable_token_auth,
            )
            pause()  # Block until SIGINT
        except KeyboardInterrupt:
            pass
        finally:
            stop(wait=False)  # Shut down TorchServe
    # Context manager cleans up MAR directory for vLLM

Example 3: Testing the Endpoint After Launch

Once the launcher is running, test the endpoint with:

# Chat completions (OpenAI-compatible)
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }'

# Text completions
curl -X POST http://localhost:8080/predictions/model/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "prompt": "Once upon a time",
        "max_tokens": 100
    }'

Related Pages

Principle:Pytorch_Serve_LLM_Quick_Start -- the design principles behind single-command LLM deployment
Environment:Pytorch_Serve_vLLM_Engine_Environment - vLLM engine environment (when engine=vllm)
Environment:Pytorch_Serve_CUDA_GPU_Environment - GPU environment for LLM inference
Heuristic:Pytorch_Serve_Batch_Size_Tuning - LLM batch_size=1 with internal batching via max_num_seqs
Heuristic:Pytorch_Serve_LLM_Timeout_Configuration - 1200s timeout and async communication defaults

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment