Implementation:Pytorch Serve Parallelism Model Config
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Title | Parallelism Model Config |
| Implements | Principle:Pytorch_Serve_Distributed_Configuration |
| Source | examples/large_models/Huggingface_pippy/model-config.yaml, examples/large_models/tp_llama/model-config.yaml
|
| Repository | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The model-config.yaml files for large model inference define the complete deployment specification for distributed serving in TorchServe. This document presents the actual YAML configuration patterns used for pipeline parallelism (PiPPy) and tensor parallelism (PyTorch TP / DeepSpeed), including all available parameters, their types, and their effects on the serving infrastructure.
Description
Two representative configuration patterns are documented here:
1. PiPPy Pipeline Parallelism Config (examples/large_models/Huggingface_pippy/model-config.yaml): Configures pipeline parallelism with parallelType: "pp", specifying RPC settings, microbatching, and FX tracing parameters under the pippy: section.
2. PyTorch TP / Tensor Parallelism Config (examples/large_models/tp_llama/model-config.yaml): Configures tensor parallelism with parallelType: "tp", using a simpler configuration that focuses on handler-level parameters for model loading and generation.
Both patterns share common frontend settings (workers, batching, timeouts, device type, torchrun) but differ in their strategy-specific sections.
Usage
Code Reference
Source Location:
examples/large_models/Huggingface_pippy/model-config.yaml(lines 1-26)examples/large_models/tp_llama/model-config.yaml(lines 1-20)
PiPPy Pipeline Parallelism Config
#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
nproc-per-node: 4
#backend settings
pippy:
rpc_timeout: 1800
model_type: "HF"
chunks: 1
input_names: ["input_ids"]
num_worker_threads: 128
handler:
model_path: "/path/to/model/checkpoints"
index_filename: 'pytorch_model.bin.index.json'
max_length: 50
max_new_tokens: 60
manual_seed: 40
dtype: fp16
Tensor Parallelism Config (Llama)
#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "tp"
deviceType: "gpu"
torchrun:
nproc-per-node: 1
handler:
converted_ckpt_dir: "converted_checkpoints"
tokenizer_path: "tokenizer.model"
model_args_path: "model_args.json"
max_new_tokens: 50
temperature: 0.6
top_p: 0.9
manual_seed: 40
mode: "chat" # choices are text_completion, chat
DeepSpeed Tensor Parallelism Config
#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
nproc-per-node: 4
#backend settings
deepspeed:
config: ds-config.json
handler:
model_name: "facebook/opt-30b"
model_path: "/path/to/model/checkpoints"
max_length: 80
max_new_tokens: 50
manual_seed: 40
Key Parameters Reference
| Section | Parameter | Type | Description |
|---|---|---|---|
| Frontend | minWorkers |
int | Minimum number of workers. Typically 1 for large models. |
| Frontend | maxWorkers |
int | Maximum number of workers. Typically 1 for large models. |
| Frontend | maxBatchDelay |
int | Max milliseconds to wait for a full batch (100-200 typical). |
| Frontend | responseTimeout |
int | Max seconds to wait for worker response (120-300 typical). |
| Frontend | parallelType |
str | "pp", "tp", "pptp", or "custom". |
| Frontend | deviceType |
str | "gpu" for GPU inference. |
| Torchrun | nproc-per-node |
int | Processes per worker (equals number of GPUs). |
| Torchrun | OMP_NUMBER_THREADS |
int | OpenMP threads per process (default 1). |
| PiPPy | pippy.rpc_timeout |
int | RPC call timeout in seconds. |
| PiPPy | pippy.model_type |
str | "HF" for HuggingFace models. |
| PiPPy | pippy.chunks |
int | Number of microbatches (microbatch = batch / chunks). |
| PiPPy | pippy.input_names |
list[str] | Input argument names for FX tracing. |
| PiPPy | pippy.num_worker_threads |
int | RPC worker thread count. |
| DeepSpeed | deepspeed.config |
str | DeepSpeed config JSON filename. |
| DeepSpeed | deepspeed.checkpoint |
str | Checkpoint index filename (optional). |
| Handler | handler.model_path |
str | Path to model checkpoints. |
| Handler | handler.max_length |
int | Maximum input token length. |
| Handler | handler.max_new_tokens |
int | Maximum tokens to generate. |
| Handler | handler.manual_seed |
int | Random seed for reproducibility. |
| Handler | handler.dtype |
str | Data type: "fp16", "fp32", "bf16". |
I/O Contract
Input: The model-config.yaml file is passed to torch-model-archiver via --config-file and is bundled into the MAR archive.
Output: At model load time, TorchServe parses the YAML and:
- Creates the specified number of workers.
- Launches each worker with
torchrun(for pp/tp/pptp) or as a single process (for custom). - Sets
CUDA_VISIBLE_DEVICESbased on GPU allocation. - Passes the parsed YAML to the handler via
ctx.model_yaml_config.
Handler access pattern:
def initialize(self, ctx):
# Access frontend settings (informational)
response_timeout = ctx.model_yaml_config.get("responseTimeout", 120)
# Access strategy-specific settings
model_type = ctx.model_yaml_config["pippy"]["model_type"]
rpc_timeout = ctx.model_yaml_config["pippy"]["rpc_timeout"]
# Access handler settings
model_path = ctx.model_yaml_config["handler"]["model_path"]
max_length = ctx.model_yaml_config["handler"]["max_length"]
Usage Examples
Packaging with PiPPy config:
torch-model-archiver --model-name bloom \
--version 1.0 \
--handler pippy_handler.py \
--extra-files /path/to/checkpoints \
-r requirements.txt \
--config-file model-config.yaml \
--archive-format tgz
Starting TorchServe with increased startup timeout:
# In config.properties
model_store=/path/to/model_store
load_models=bloom.tar.gz
startup_timeout=600
Related Pages
- Principle:Pytorch_Serve_Distributed_Configuration - Theory of declarative distributed configuration
- Pytorch_Serve_ParallelType_Config - ParallelType field options
- Pytorch_Serve_BasePippyHandler - PiPPy handler that reads pippy config
- Pytorch_Serve_BaseDeepSpeedHandler - DeepSpeed handler that reads deepspeed config
- Pytorch_Serve_TorchModelServiceWorker - Worker spawning based on torchrun config
- Environment:Pytorch_Serve_Distributed_Training_Environment - Distributed env vars for parallelism
- Environment:Pytorch_Serve_DeepSpeed_Environment - DeepSpeed env (when using DeepSpeed parallelism)
- Configuration
- Distributed_Computing