Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve LLM Launcher Main

From Leeroopedia
Field Value
Page Type Implementation
Implementation Type API Doc
Domains LLM_Serving, Automation
Knowledge Sources TorchServe
Workflow LLM_Deployment_vLLM
Last Updated 2026-02-13 00:00 GMT

Overview

The LLM Launcher (ts/llm_launcher.py) is a CLI tool that automates the full lifecycle of deploying an LLM on TorchServe: generating model configuration, creating a model archive, starting the server, and cleaning up on shutdown. It supports both vLLM and TensorRT-LLM backends, with the default engine being vLLM and the default model being meta-llama/Meta-Llama-3.1-8B-Instruct.

Description

The launcher module contains three primary functions that form a pipeline:

  1. get_model_config(args) -- generates a model configuration dictionary from CLI arguments and hardware introspection
  2. create_mar_file(args) -- serializes the configuration to YAML, creates a model archive, and manages cleanup via context manager
  3. main(args) -- orchestrates the full deployment: creates the model store directory, builds the archive, starts TorchServe, and waits for termination

Usage

# Basic launch with defaults (Llama 3.1 8B, vLLM engine)
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine vllm

# Custom batch size and context length
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --engine vllm \
    --vllm_engine.max_num_seqs 512 \
    --vllm_engine.max_model_len 4096

# Disable token authentication for local development
python -m ts.llm_launcher --model_id mistralai/Mistral-7B-v0.1 \
    --engine vllm \
    --disable_token_auth

# Custom model store and model name
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct \
    --model_name llama3 \
    --model_store /data/model_store

Code Reference

Source Location

File Lines Function
ts/llm_launcher.py L164-197 main(args) -- orchestration entry point
ts/llm_launcher.py L130-161 create_mar_file(args, model_snapshot_path=None) -- archive creation context manager
ts/llm_launcher.py L63-127 get_model_config(args, model_snapshot_path=None) -- configuration generation
ts/llm_launcher.py L200-287 argparse argument definitions

Signature

def main(args):
    """
    Register the model in torchserve.

    Orchestrates the full LLM deployment lifecycle:
    1. Creates model store directory
    2. Downloads model (for TRT-LLM only; vLLM downloads on engine init)
    3. Creates model archive via create_mar_file context manager
    4. Starts TorchServe with the model pre-registered
    5. Blocks until KeyboardInterrupt (SIGINT)
    6. Stops TorchServe and cleans up the archive

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments including model_id,
            engine, model_store, model_name, and engine-specific parameters.
    """
@contextlib.contextmanager
def create_mar_file(args, model_snapshot_path=None):
    """
    Context manager that creates a model archive and cleans up on exit.

    1. Generates model-config.yaml from get_model_config()
    2. Creates a no-archive format MAR using ModelArchiverConfig
    3. Yields the MAR file path
    4. On exit, removes the MAR directory (for vLLM engine)

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments.
        model_snapshot_path (str|None): Local path to downloaded model snapshot
            (used by TRT-LLM; None for vLLM).

    Yields:
        str: Path to the created model archive directory.
    """
def get_model_config(args, model_snapshot_path=None):
    """
    Generate model configuration dictionary for TorchServe.

    For vLLM engine, auto-detects GPU count via torch.cuda.device_count()
    and sets tensor_parallel_size accordingly. Constructs the handler
    configuration with vllm_engine_config parameters.

    Parameters:
        args (argparse.Namespace): Parsed CLI arguments.
        model_snapshot_path (str|None): Local model path (for TRT-LLM).

    Returns:
        dict: Model configuration suitable for serialization to YAML.
            Keys include: minWorkers, maxWorkers, batchSize, maxBatchDelay,
            responseTimeout, startupTimeout, deviceType, asyncCommunication,
            parallelLevel, handler (with model_path and vllm_engine_config).
    """

Import

# The launcher is invoked as a module:
# python -m ts.llm_launcher [OPTIONS]

# Internal imports used by the module:
from model_archiver import ModelArchiverConfig
from model_archiver.model_packaging import generate_model_archive
from ts.launcher import start, stop
from ts.utils.hf_utils import download_model

I/O Contract

Direction Type Description
Input CLI arguments Model ID, engine type, engine-specific parameters (see argument table below)
Output Running server TorchServe process listening on default ports (8080 inference, 8081 management, 8082 metrics)
Side Effect File system Model archive directory created in --model_store path; cleaned up on vLLM exit
Precondition Environment PyTorch, TorchServe, vLLM installed; GPU(s) available; model accessible (HuggingFace credentials if gated model)
Postcondition Server state Model registered and serving; server blocks until SIGINT

CLI Argument Reference

Argument Type Default Description
--model_name str "model" Name for the registered model
--model_store str "model_store" Directory for model archives
--model_id str "meta-llama/Meta-Llama-3.1-8B-Instruct" HuggingFace model ID or local path
--disable_token_auth flag false Disable TorchServe token authentication
--vllm_engine.max_num_seqs int 256 Maximum concurrent sequences in vLLM batch
--vllm_engine.max_model_len int None (model default) Maximum context length in tokens
--vllm_engine.download_dir str None Custom model download/cache directory
--startup_timeout int 1200 Model startup timeout in seconds
--engine str "vllm" LLM engine backend (vllm or trt_llm)
--dtype str "bfloat16" Data type for model weights

Usage Examples

Example 1: Default Launch (Llama 3.1 8B with vLLM)

python -m ts.llm_launcher

This uses all defaults:

  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Engine: vLLM
  • max_num_seqs: 256
  • tensor_parallel_size: auto-detected from GPU count

The generated model configuration (internal) will be:

{
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "responseTimeout": 1200,
    "startupTimeout": 1200,
    "deviceType": "gpu",
    "asyncCommunication": True,
    "parallelLevel": torch.cuda.device_count(),  # e.g., 4
    "handler": {
        "model_path": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "vllm_engine_config": {
            "max_num_seqs": 256,
            "max_model_len": None,
            "download_dir": None,
            "tensor_parallel_size": torch.cuda.device_count(),  # e.g., 4
        },
    },
}

Example 2: Server Lifecycle

The main() function manages the complete lifecycle:

def main(args):
    # 1. Create model store directory
    model_store_path = Path(args.model_store)
    model_store_path.mkdir(parents=True, exist_ok=True)

    # 2. For vLLM, no pre-download needed (engine handles it)
    model_snapshot_path = None

    # 3. Create archive, start server, wait for interrupt
    with create_mar_file(args, model_snapshot_path):
        try:
            start(
                model_store=args.model_store,
                no_config_snapshots=True,
                models=args.model_name,
                disable_token=args.disable_token_auth,
            )
            pause()  # Block until SIGINT
        except KeyboardInterrupt:
            pass
        finally:
            stop(wait=False)  # Shut down TorchServe
    # Context manager cleans up MAR directory for vLLM

Example 3: Testing the Endpoint After Launch

Once the launcher is running, test the endpoint with:

# Chat completions (OpenAI-compatible)
curl -X POST http://localhost:8080/predictions/model/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 100
    }'

# Text completions
curl -X POST http://localhost:8080/predictions/model/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
        "prompt": "Once upon a time",
        "max_tokens": 100
    }'

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment