Implementation:Pytorch Serve Parallelism Model Config

Field	Value
Page Type	Implementation (Pattern Doc)
Title	Parallelism Model Config
Implements	Principle:Pytorch_Serve_Distributed_Configuration
Source	`examples/large_models/Huggingface_pippy/model-config.yaml`, `examples/large_models/tp_llama/model-config.yaml`
Repository	TorchServe
Last Updated	2026-02-13 00:00 GMT

Overview

The model-config.yaml files for large model inference define the complete deployment specification for distributed serving in TorchServe. This document presents the actual YAML configuration patterns used for pipeline parallelism (PiPPy) and tensor parallelism (PyTorch TP / DeepSpeed), including all available parameters, their types, and their effects on the serving infrastructure.

Description

Two representative configuration patterns are documented here:

1. PiPPy Pipeline Parallelism Config (examples/large_models/Huggingface_pippy/model-config.yaml): Configures pipeline parallelism with parallelType: "pp", specifying RPC settings, microbatching, and FX tracing parameters under the pippy: section.

2. PyTorch TP / Tensor Parallelism Config (examples/large_models/tp_llama/model-config.yaml): Configures tensor parallelism with parallelType: "tp", using a simpler configuration that focuses on handler-level parameters for model loading and generation.

Both patterns share common frontend settings (workers, batching, timeouts, device type, torchrun) but differ in their strategy-specific sections.

Usage

Code Reference

Source Location:

examples/large_models/Huggingface_pippy/model-config.yaml (lines 1-26)
examples/large_models/tp_llama/model-config.yaml (lines 1-20)

PiPPy Pipeline Parallelism Config

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4

#backend settings
pippy:
    rpc_timeout: 1800
    model_type: "HF"
    chunks: 1
    input_names: ["input_ids"]
    num_worker_threads: 128

handler:
    model_path: "/path/to/model/checkpoints"
    index_filename: 'pytorch_model.bin.index.json'
    max_length: 50
    max_new_tokens: 60
    manual_seed: 40
    dtype: fp16

Tensor Parallelism Config (Llama)

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "tp"
deviceType: "gpu"

torchrun:
    nproc-per-node: 1

handler:
    converted_ckpt_dir: "converted_checkpoints"
    tokenizer_path: "tokenizer.model"
    model_args_path: "model_args.json"
    max_new_tokens: 50
    temperature: 0.6
    top_p: 0.9
    manual_seed: 40
    mode: "chat"  # choices are text_completion, chat

DeepSpeed Tensor Parallelism Config

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4

#backend settings
deepspeed:
    config: ds-config.json

handler:
    model_name: "facebook/opt-30b"
    model_path: "/path/to/model/checkpoints"
    max_length: 80
    max_new_tokens: 50
    manual_seed: 40

Key Parameters Reference

Section	Parameter	Type	Description
Frontend	`minWorkers`	int	Minimum number of workers. Typically 1 for large models.
Frontend	`maxWorkers`	int	Maximum number of workers. Typically 1 for large models.
Frontend	`maxBatchDelay`	int	Max milliseconds to wait for a full batch (100-200 typical).
Frontend	`responseTimeout`	int	Max seconds to wait for worker response (120-300 typical).
Frontend	`parallelType`	str	"pp", "tp", "pptp", or "custom".
Frontend	`deviceType`	str	"gpu" for GPU inference.
Torchrun	`nproc-per-node`	int	Processes per worker (equals number of GPUs).
Torchrun	`OMP_NUMBER_THREADS`	int	OpenMP threads per process (default 1).
PiPPy	`pippy.rpc_timeout`	int	RPC call timeout in seconds.
PiPPy	`pippy.model_type`	str	"HF" for HuggingFace models.
PiPPy	`pippy.chunks`	int	Number of microbatches (microbatch = batch / chunks).
PiPPy	`pippy.input_names`	list[str]	Input argument names for FX tracing.
PiPPy	`pippy.num_worker_threads`	int	RPC worker thread count.
DeepSpeed	`deepspeed.config`	str	DeepSpeed config JSON filename.
DeepSpeed	`deepspeed.checkpoint`	str	Checkpoint index filename (optional).
Handler	`handler.model_path`	str	Path to model checkpoints.
Handler	`handler.max_length`	int	Maximum input token length.
Handler	`handler.max_new_tokens`	int	Maximum tokens to generate.
Handler	`handler.manual_seed`	int	Random seed for reproducibility.
Handler	`handler.dtype`	str	Data type: "fp16", "fp32", "bf16".

I/O Contract

Input: The model-config.yaml file is passed to torch-model-archiver via --config-file and is bundled into the MAR archive.

Output: At model load time, TorchServe parses the YAML and:

Creates the specified number of workers.
Launches each worker with torchrun (for pp/tp/pptp) or as a single process (for custom).
Sets CUDA_VISIBLE_DEVICES based on GPU allocation.
Passes the parsed YAML to the handler via ctx.model_yaml_config.

Handler access pattern:

def initialize(self, ctx):
    # Access frontend settings (informational)
    response_timeout = ctx.model_yaml_config.get("responseTimeout", 120)

    # Access strategy-specific settings
    model_type = ctx.model_yaml_config["pippy"]["model_type"]
    rpc_timeout = ctx.model_yaml_config["pippy"]["rpc_timeout"]

    # Access handler settings
    model_path = ctx.model_yaml_config["handler"]["model_path"]
    max_length = ctx.model_yaml_config["handler"]["max_length"]

Usage Examples

Packaging with PiPPy config:

torch-model-archiver --model-name bloom \
    --version 1.0 \
    --handler pippy_handler.py \
    --extra-files /path/to/checkpoints \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Starting TorchServe with increased startup timeout:

# In config.properties
model_store=/path/to/model_store
load_models=bloom.tar.gz
startup_timeout=600

Related Pages

Principle:Pytorch_Serve_Distributed_Configuration - Theory of declarative distributed configuration
Pytorch_Serve_ParallelType_Config - ParallelType field options
Pytorch_Serve_BasePippyHandler - PiPPy handler that reads pippy config
Pytorch_Serve_BaseDeepSpeedHandler - DeepSpeed handler that reads deepspeed config
Pytorch_Serve_TorchModelServiceWorker - Worker spawning based on torchrun config
Environment:Pytorch_Serve_Distributed_Training_Environment - Distributed env vars for parallelism
Environment:Pytorch_Serve_DeepSpeed_Environment - DeepSpeed env (when using DeepSpeed parallelism)
Configuration
Distributed_Computing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment