Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve Parallelism Model Config

From Leeroopedia
Field Value
Page Type Implementation (Pattern Doc)
Title Parallelism Model Config
Implements Principle:Pytorch_Serve_Distributed_Configuration
Source examples/large_models/Huggingface_pippy/model-config.yaml, examples/large_models/tp_llama/model-config.yaml
Repository TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

The model-config.yaml files for large model inference define the complete deployment specification for distributed serving in TorchServe. This document presents the actual YAML configuration patterns used for pipeline parallelism (PiPPy) and tensor parallelism (PyTorch TP / DeepSpeed), including all available parameters, their types, and their effects on the serving infrastructure.

Description

Two representative configuration patterns are documented here:

1. PiPPy Pipeline Parallelism Config (examples/large_models/Huggingface_pippy/model-config.yaml): Configures pipeline parallelism with parallelType: "pp", specifying RPC settings, microbatching, and FX tracing parameters under the pippy: section.

2. PyTorch TP / Tensor Parallelism Config (examples/large_models/tp_llama/model-config.yaml): Configures tensor parallelism with parallelType: "tp", using a simpler configuration that focuses on handler-level parameters for model loading and generation.

Both patterns share common frontend settings (workers, batching, timeouts, device type, torchrun) but differ in their strategy-specific sections.

Usage

Code Reference

Source Location:

  • examples/large_models/Huggingface_pippy/model-config.yaml (lines 1-26)
  • examples/large_models/tp_llama/model-config.yaml (lines 1-20)

PiPPy Pipeline Parallelism Config

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4

#backend settings
pippy:
    rpc_timeout: 1800
    model_type: "HF"
    chunks: 1
    input_names: ["input_ids"]
    num_worker_threads: 128

handler:
    model_path: "/path/to/model/checkpoints"
    index_filename: 'pytorch_model.bin.index.json'
    max_length: 50
    max_new_tokens: 60
    manual_seed: 40
    dtype: fp16

Tensor Parallelism Config (Llama)

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "tp"
deviceType: "gpu"

torchrun:
    nproc-per-node: 1

handler:
    converted_ckpt_dir: "converted_checkpoints"
    tokenizer_path: "tokenizer.model"
    model_args_path: "model_args.json"
    max_new_tokens: 50
    temperature: 0.6
    top_p: 0.9
    manual_seed: 40
    mode: "chat"  # choices are text_completion, chat

DeepSpeed Tensor Parallelism Config

#frontend settings
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4

#backend settings
deepspeed:
    config: ds-config.json

handler:
    model_name: "facebook/opt-30b"
    model_path: "/path/to/model/checkpoints"
    max_length: 80
    max_new_tokens: 50
    manual_seed: 40

Key Parameters Reference

Section Parameter Type Description
Frontend minWorkers int Minimum number of workers. Typically 1 for large models.
Frontend maxWorkers int Maximum number of workers. Typically 1 for large models.
Frontend maxBatchDelay int Max milliseconds to wait for a full batch (100-200 typical).
Frontend responseTimeout int Max seconds to wait for worker response (120-300 typical).
Frontend parallelType str "pp", "tp", "pptp", or "custom".
Frontend deviceType str "gpu" for GPU inference.
Torchrun nproc-per-node int Processes per worker (equals number of GPUs).
Torchrun OMP_NUMBER_THREADS int OpenMP threads per process (default 1).
PiPPy pippy.rpc_timeout int RPC call timeout in seconds.
PiPPy pippy.model_type str "HF" for HuggingFace models.
PiPPy pippy.chunks int Number of microbatches (microbatch = batch / chunks).
PiPPy pippy.input_names list[str] Input argument names for FX tracing.
PiPPy pippy.num_worker_threads int RPC worker thread count.
DeepSpeed deepspeed.config str DeepSpeed config JSON filename.
DeepSpeed deepspeed.checkpoint str Checkpoint index filename (optional).
Handler handler.model_path str Path to model checkpoints.
Handler handler.max_length int Maximum input token length.
Handler handler.max_new_tokens int Maximum tokens to generate.
Handler handler.manual_seed int Random seed for reproducibility.
Handler handler.dtype str Data type: "fp16", "fp32", "bf16".

I/O Contract

Input: The model-config.yaml file is passed to torch-model-archiver via --config-file and is bundled into the MAR archive.

Output: At model load time, TorchServe parses the YAML and:

  • Creates the specified number of workers.
  • Launches each worker with torchrun (for pp/tp/pptp) or as a single process (for custom).
  • Sets CUDA_VISIBLE_DEVICES based on GPU allocation.
  • Passes the parsed YAML to the handler via ctx.model_yaml_config.

Handler access pattern:

def initialize(self, ctx):
    # Access frontend settings (informational)
    response_timeout = ctx.model_yaml_config.get("responseTimeout", 120)

    # Access strategy-specific settings
    model_type = ctx.model_yaml_config["pippy"]["model_type"]
    rpc_timeout = ctx.model_yaml_config["pippy"]["rpc_timeout"]

    # Access handler settings
    model_path = ctx.model_yaml_config["handler"]["model_path"]
    max_length = ctx.model_yaml_config["handler"]["max_length"]

Usage Examples

Packaging with PiPPy config:

torch-model-archiver --model-name bloom \
    --version 1.0 \
    --handler pippy_handler.py \
    --extra-files /path/to/checkpoints \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Starting TorchServe with increased startup timeout:

# In config.properties
model_store=/path/to/model_store
load_models=bloom.tar.gz
startup_timeout=600

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment