Workflow:Pytorch Serve Large Model Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Distributed_Computing, Model_Serving |
| Last Updated | 2026-02-13 18:00 GMT |
Overview
End-to-end process for serving large PyTorch models that exceed single-GPU memory by partitioning them across multiple GPUs using pipeline parallelism, tensor parallelism, or hybrid strategies.
Description
This workflow covers deploying models too large for a single GPU using TorchServe's distributed inference capabilities. It supports multiple parallelism frameworks: PiPPy (PyTorch-native pipeline parallelism), DeepSpeed (tensor parallelism), HuggingFace Accelerate (automatic device mapping), and PyTorch native tensor parallelism. TorchServe uses torchrun to spawn multiple processes across GPUs and manages CUDA_VISIBLE_DEVICES assignment automatically via round-robin allocation.
Usage
Execute this workflow when you have a transformer-based model that requires more GPU memory than a single device can provide (e.g., 7B+ parameter models), or when you need to achieve higher inference throughput by distributing compute across multiple GPUs.
Execution Steps
Step 1: Select Parallelism Strategy
Choose the distributed inference framework based on your model architecture and hardware. Each strategy has different trade-offs in terms of supported models, memory efficiency, and throughput.
Key considerations:
- PiPPy (pipeline parallel): Splits model into sequential stages across GPUs. Best for encoder-decoder architectures. Uses microbatching for throughput.
- DeepSpeed (tensor parallel): Splits individual layers across GPUs. Best for large transformer models. Supports kernel injection for optimized inference.
- Accelerate (auto device map): Automatically distributes model layers across available devices. Simplest setup for HuggingFace models.
- PyTorch TP (native tensor parallel): Uses PyTorch's built-in tensor parallelism. Best for Llama-family models with checkpoint conversion.
Step 2: Develop the Distributed Handler
Create a custom handler by extending the appropriate base handler class for the chosen framework. The handler initializes the distributed model using the framework's API and processes inference requests. Each framework has a base handler class that manages the distributed environment setup.
What happens:
- For PiPPy: extend BasePippyHandler, call get_pipline_driver to create the pipeline
- For DeepSpeed: extend BaseDeepSpeedHandler, call get_ds_engine to initialize the inference engine
- For Accelerate: use standard handler with device_map="auto" and low_cpu_mem_usage=True
- For PyTorch TP: implement custom handler using PyTorch's distributed tensor parallelism primitives
Step 3: Configure Model Parallelism
Create a model-config.yaml that specifies the parallelism type, number of GPUs per worker, torchrun settings, and framework-specific parameters. This configuration tells TorchServe how to spawn and manage distributed workers.
Key considerations:
- Set parallelType to pp, tp, pptp, or custom depending on the strategy
- Set nproc-per-node for torchrun-based frameworks (PiPPy, DeepSpeed) OR parallelLevel for custom parallelism (vLLM, Accelerate)
- Configure minWorkers and maxWorkers to 1 since each worker uses multiple GPUs
- Adjust responseTimeout and startupTimeout for large model loading times
Step 4: Create Model Archive
Package the handler, model checkpoints, configuration files, and dependencies into a model archive. For large models, use the .tar.gz archive format to avoid compression overhead. Include the DeepSpeed config JSON, requirements.txt, and model-config.yaml.
Pseudocode:
torch-model-archiver \ --model-name <name> \ --version 1.0 \ --handler <distributed_handler.py> \ --extra-files <checkpoints>,<config_files> \ -r requirements.txt \ --config-file model-config.yaml \ --archive-format tgz
Step 5: Launch TorchServe and Serve
Start TorchServe with the model archive. TorchServe automatically manages GPU assignment using round-robin allocation based on the configured parallelism level. For torchrun-based strategies, TorchServe spawns multiple processes per worker, each assigned to a different GPU.
Key considerations:
- Pre-install model parallel libraries (DeepSpeed, etc.) on the host to reduce startup latency
- Pre-download model checkpoints to local storage or HuggingFace cache
- Use streaming response for auto-regressive generation to reduce perceived latency
- Enable the job ticket feature for latency-sensitive applications to avoid queuing
Step 6: Run Distributed Inference
Send inference requests to the standard TorchServe endpoint. The frontend routes requests to the rank-0 process (for pipeline parallel) or broadcasts to all ranks (for tensor parallel). Results are collected and returned through the standard inference API.
Key considerations:
- For pipeline parallel, only rank 0 receives and returns data
- For tensor parallel, all ranks participate in every inference step
- Streaming response is supported via HTTP chunked encoding and gRPC server-side streaming
- Batch inference works but optimal batch sizes differ from single-GPU serving