Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve Distributed Configuration

From Leeroopedia
Field Value
Page Type Principle
Title Distributed Configuration
Domains Configuration, Distributed_Computing
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

Declarative configuration of distributed serving in TorchServe is achieved through the model-config.yaml file. This YAML file specifies the parallelism type, GPU allocation, torchrun settings, and strategy-specific parameters in a single configuration artifact that is bundled with the model archive. The configuration-driven approach separates the deployment topology from the handler code, allowing the same handler to be deployed with different parallelism settings without code changes.

Description

The model-config.yaml serves as the central configuration point for large model inference in TorchServe. It is organized into several sections:

Frontend Settings control how TorchServe manages the model workers:

  • minWorkers / maxWorkers: Number of worker instances. For large models, typically set to 1 since each worker consumes multiple GPUs.
  • maxBatchDelay: Maximum time (in milliseconds) to wait for a full batch before processing.
  • responseTimeout: Maximum time (in seconds) to wait for a worker response before timing out.
  • parallelType: The parallelism strategy ("pp", "tp", "pptp", "custom").
  • deviceType: Device type, typically "gpu".

Torchrun Settings control the distributed process launcher:

  • nproc-per-node: Number of processes (GPUs) per worker.
  • OMP_NUMBER_THREADS: Number of OpenMP threads per process (defaults to 1).

Strategy-Specific Settings provide parameters for the chosen parallelism framework:

  • pippy section: RPC timeout, model type, chunks, input names, worker threads.
  • deepspeed section: Config file path, checkpoint path.

Handler Settings provide model-specific parameters:

  • model_path: Path to model checkpoints.
  • max_length: Maximum token length for tokenizer.
  • Additional handler-specific parameters (seed, dtype, temperature, etc.).

The principle of declarative configuration means that the operational topology (how many GPUs, which parallelism strategy, what timeouts) is specified separately from the model serving logic (how to preprocess, infer, and postprocess). This separation enables:

  • Redeploying the same model with different GPU counts without code changes.
  • Switching parallelism strategies by changing a single YAML field.
  • Tuning performance parameters (batch delay, timeouts, threads) independently of model logic.

Usage

When creating a model-config.yaml for large model inference:

  1. Set minWorkers: 1 and maxWorkers: 1 unless the node has enough GPUs for multiple workers.
  2. Choose parallelType based on the model and framework ("pp" for PiPPy, "tp" for DeepSpeed/PyTorch TP, "custom" for Accelerate).
  3. Set torchrun.nproc-per-node to the number of GPUs per worker (do not set parallelLevel simultaneously).
  4. Add the strategy-specific section (pippy:, deepspeed:, or neither for custom).
  5. Set responseTimeout high enough for large model inference (120-300 seconds is common).
  6. Increase startupTimeout in TorchServe's config.properties if model loading is slow.
  7. Bundle the YAML file with the model archive via torch-model-archiver --config-file model-config.yaml.

Theoretical Basis

The declarative configuration principle is rooted in the separation of concerns design pattern. In distributed systems, the deployment topology (how computation is distributed across nodes and devices) should be orthogonal to the application logic (what computation is performed).

This approach aligns with the Infrastructure as Code paradigm where deployment configurations are versioned artifacts. By including the configuration in the model archive (MAR file), each model version carries its own deployment specification, ensuring reproducibility.

The YAML format provides a hierarchical structure that naturally maps to the layered architecture of TorchServe:

  • Frontend layer: Worker management, batching, timeouts
  • Process layer: torchrun settings, process count
  • Framework layer: PiPPy/DeepSpeed-specific parameters
  • Handler layer: Model-specific settings

This hierarchy ensures that changes at one layer (e.g., increasing GPU count) do not require changes at other layers (e.g., handler logic), promoting modularity and maintainability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment