Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Pytorch Serve Model Deployment

From Leeroopedia
Knowledge Sources
Domains MLOps, Model_Serving, PyTorch
Last Updated 2026-02-13 18:00 GMT

Overview

End-to-end process for packaging a PyTorch model as a Model Archive (MAR) and serving it for inference via TorchServe's REST and gRPC APIs.

Description

This workflow covers the standard procedure for deploying any PyTorch model to production using TorchServe. It begins with developing a handler that defines preprocess, inference, and postprocess logic, proceeds through model archiving into a redistributable .mar file, and concludes with server startup, model registration, and inference execution. The workflow supports both eager mode and TorchScript models, with optional torch.compile acceleration.

Usage

Execute this workflow when you have a trained PyTorch model (either a TorchScript archive or an eager-mode state_dict) and need to deploy it behind a REST/gRPC API for production inference. This is the primary "getting started" path for any TorchServe deployment.

Execution Steps

Step 1: Develop the Inference Handler

Create or select a handler that defines how TorchServe loads the model and processes requests. TorchServe provides built-in handlers (image_classifier, text_classifier, object_detector, image_segmenter) for common tasks. For custom use cases, extend BaseHandler and override preprocess, inference, or postprocess methods as needed.

Key considerations:

  • Prefer extending BaseHandler rather than writing from scratch
  • The handler must implement an initialize method for model loading and a handle method for inference
  • Use the Context object to access model_dir, GPU assignment, and batch metadata

Step 2: Prepare Model Artifacts

Gather the required model files: the serialized model weights (.pt or .pth), the model architecture file (for eager mode), and any extra files (index_to_name.json, tokenizer configs, etc.). Optionally create a model-config.yaml to configure worker count, batch size, response timeout, and device settings.

Key considerations:

  • For TorchScript models, only the serialized file is needed
  • For eager mode, both model architecture file and state_dict are required
  • Include a requirements.txt if the handler depends on packages not bundled with TorchServe

Step 3: Create Model Archive (MAR)

Use the torch-model-archiver CLI to package handler, model weights, extra files, and configuration into a single .mar file. This archive is self-contained and can be redistributed to any TorchServe instance.

Pseudocode:

torch-model-archiver \
  --model-name <name> \
  --version <version> \
  --handler <handler_path> \
  --serialized-file <weights.pth> \
  --extra-files <extra_files> \
  --config-file model-config.yaml \
  --export-path model_store

Step 4: Start TorchServe

Launch the TorchServe server, pointing it to the model store directory containing the .mar file. TorchServe starts a Java frontend for API management and spawns Python backend workers for model inference. Optionally register models at startup using the --models flag.

Key considerations:

  • TorchServe exposes three ports: 8080 (inference), 8081 (management), 8082 (metrics)
  • Token authorization is enabled by default; disable with --disable-token-auth for development
  • Worker auto-scaling is based on available vCPUs or GPUs

Step 5: Register and Scale the Model

If not registered at startup, use the Management API to register the model from the model store, set initial workers, and configure batch size and delay. This step allows fine-grained control over resource allocation per model.

What happens:

  • POST to /models registers a new model from its .mar URL
  • Parameters control initial_workers, batch_size, max_batch_delay
  • PUT to /models/{name} adjusts worker count for scaling

Step 6: Run Inference

Send prediction requests to the Inference API endpoint. TorchServe accepts data via REST (POST /predictions/{model_name}) or gRPC (StreamPredictions). The frontend routes requests to backend workers, which execute the handler's preprocess-inference-postprocess pipeline and return results.

Key considerations:

  • REST API supports both form-data and raw body payloads
  • gRPC provides both unary and streaming inference
  • Batch inference groups multiple concurrent requests automatically based on batch_size and max_batch_delay settings

Execution Diagram

GitHub URL

Workflow Repository