Workflow:Pytorch Serve Large Model Inference

Knowledge Sources	TorchServe Large Model Inference Guide DeepSpeed Inference
Domains	LLMs, Distributed_Computing, Model_Serving
Last Updated	2026-02-13 18:00 GMT

Overview

End-to-end process for serving large PyTorch models that exceed single-GPU memory by partitioning them across multiple GPUs using pipeline parallelism, tensor parallelism, or hybrid strategies.

Description

This workflow covers deploying models too large for a single GPU using TorchServe's distributed inference capabilities. It supports multiple parallelism frameworks: PiPPy (PyTorch-native pipeline parallelism), DeepSpeed (tensor parallelism), HuggingFace Accelerate (automatic device mapping), and PyTorch native tensor parallelism. TorchServe uses torchrun to spawn multiple processes across GPUs and manages CUDA_VISIBLE_DEVICES assignment automatically via round-robin allocation.

Usage

Execute this workflow when you have a transformer-based model that requires more GPU memory than a single device can provide (e.g., 7B+ parameter models), or when you need to achieve higher inference throughput by distributing compute across multiple GPUs.

Execution Steps

Step 1: Select Parallelism Strategy

Choose the distributed inference framework based on your model architecture and hardware. Each strategy has different trade-offs in terms of supported models, memory efficiency, and throughput.

Key considerations:

PiPPy (pipeline parallel): Splits model into sequential stages across GPUs. Best for encoder-decoder architectures. Uses microbatching for throughput.
DeepSpeed (tensor parallel): Splits individual layers across GPUs. Best for large transformer models. Supports kernel injection for optimized inference.
Accelerate (auto device map): Automatically distributes model layers across available devices. Simplest setup for HuggingFace models.
PyTorch TP (native tensor parallel): Uses PyTorch's built-in tensor parallelism. Best for Llama-family models with checkpoint conversion.

Step 2: Develop the Distributed Handler

Create a custom handler by extending the appropriate base handler class for the chosen framework. The handler initializes the distributed model using the framework's API and processes inference requests. Each framework has a base handler class that manages the distributed environment setup.

What happens:

For PiPPy: extend BasePippyHandler, call get_pipline_driver to create the pipeline
For DeepSpeed: extend BaseDeepSpeedHandler, call get_ds_engine to initialize the inference engine
For Accelerate: use standard handler with device_map="auto" and low_cpu_mem_usage=True
For PyTorch TP: implement custom handler using PyTorch's distributed tensor parallelism primitives

Step 3: Configure Model Parallelism

Create a model-config.yaml that specifies the parallelism type, number of GPUs per worker, torchrun settings, and framework-specific parameters. This configuration tells TorchServe how to spawn and manage distributed workers.

Key considerations:

Set parallelType to pp, tp, pptp, or custom depending on the strategy
Set nproc-per-node for torchrun-based frameworks (PiPPy, DeepSpeed) OR parallelLevel for custom parallelism (vLLM, Accelerate)
Configure minWorkers and maxWorkers to 1 since each worker uses multiple GPUs
Adjust responseTimeout and startupTimeout for large model loading times

Step 4: Create Model Archive

Package the handler, model checkpoints, configuration files, and dependencies into a model archive. For large models, use the .tar.gz archive format to avoid compression overhead. Include the DeepSpeed config JSON, requirements.txt, and model-config.yaml.

Pseudocode:

torch-model-archiver \
  --model-name <name> \
  --version 1.0 \
  --handler <distributed_handler.py> \
  --extra-files <checkpoints>,<config_files> \
  -r requirements.txt \
  --config-file model-config.yaml \
  --archive-format tgz

Step 5: Launch TorchServe and Serve

Start TorchServe with the model archive. TorchServe automatically manages GPU assignment using round-robin allocation based on the configured parallelism level. For torchrun-based strategies, TorchServe spawns multiple processes per worker, each assigned to a different GPU.

Key considerations:

Pre-install model parallel libraries (DeepSpeed, etc.) on the host to reduce startup latency
Pre-download model checkpoints to local storage or HuggingFace cache
Use streaming response for auto-regressive generation to reduce perceived latency
Enable the job ticket feature for latency-sensitive applications to avoid queuing

Step 6: Run Distributed Inference

Send inference requests to the standard TorchServe endpoint. The frontend routes requests to the rank-0 process (for pipeline parallel) or broadcasts to all ranks (for tensor parallel). Results are collected and returned through the standard inference API.

Key considerations:

For pipeline parallel, only rank 0 receives and returns data
For tensor parallel, all ranks participate in every inference step
Streaming response is supported via HTTP chunked encoding and gRPC server-side streaming
Batch inference works but optimal batch sizes differ from single-GPU serving

Execution Diagram

GitHub URL

Workflow Repository