Principle:Pytorch Serve Distributed Configuration
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Distributed Configuration |
| Domains | Configuration, Distributed_Computing |
| Knowledge Sources | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Declarative configuration of distributed serving in TorchServe is achieved through the model-config.yaml file. This YAML file specifies the parallelism type, GPU allocation, torchrun settings, and strategy-specific parameters in a single configuration artifact that is bundled with the model archive. The configuration-driven approach separates the deployment topology from the handler code, allowing the same handler to be deployed with different parallelism settings without code changes.
Description
The model-config.yaml serves as the central configuration point for large model inference in TorchServe. It is organized into several sections:
Frontend Settings control how TorchServe manages the model workers:
minWorkers/maxWorkers: Number of worker instances. For large models, typically set to 1 since each worker consumes multiple GPUs.maxBatchDelay: Maximum time (in milliseconds) to wait for a full batch before processing.responseTimeout: Maximum time (in seconds) to wait for a worker response before timing out.parallelType: The parallelism strategy ("pp", "tp", "pptp", "custom").deviceType: Device type, typically "gpu".
Torchrun Settings control the distributed process launcher:
nproc-per-node: Number of processes (GPUs) per worker.OMP_NUMBER_THREADS: Number of OpenMP threads per process (defaults to 1).
Strategy-Specific Settings provide parameters for the chosen parallelism framework:
pippysection: RPC timeout, model type, chunks, input names, worker threads.deepspeedsection: Config file path, checkpoint path.
Handler Settings provide model-specific parameters:
model_path: Path to model checkpoints.max_length: Maximum token length for tokenizer.- Additional handler-specific parameters (seed, dtype, temperature, etc.).
The principle of declarative configuration means that the operational topology (how many GPUs, which parallelism strategy, what timeouts) is specified separately from the model serving logic (how to preprocess, infer, and postprocess). This separation enables:
- Redeploying the same model with different GPU counts without code changes.
- Switching parallelism strategies by changing a single YAML field.
- Tuning performance parameters (batch delay, timeouts, threads) independently of model logic.
Usage
When creating a model-config.yaml for large model inference:
- Set
minWorkers: 1andmaxWorkers: 1unless the node has enough GPUs for multiple workers. - Choose
parallelTypebased on the model and framework ("pp" for PiPPy, "tp" for DeepSpeed/PyTorch TP, "custom" for Accelerate). - Set
torchrun.nproc-per-nodeto the number of GPUs per worker (do not setparallelLevelsimultaneously). - Add the strategy-specific section (
pippy:,deepspeed:, or neither for custom). - Set
responseTimeouthigh enough for large model inference (120-300 seconds is common). - Increase
startupTimeoutin TorchServe'sconfig.propertiesif model loading is slow. - Bundle the YAML file with the model archive via
torch-model-archiver --config-file model-config.yaml.
Theoretical Basis
The declarative configuration principle is rooted in the separation of concerns design pattern. In distributed systems, the deployment topology (how computation is distributed across nodes and devices) should be orthogonal to the application logic (what computation is performed).
This approach aligns with the Infrastructure as Code paradigm where deployment configurations are versioned artifacts. By including the configuration in the model archive (MAR file), each model version carries its own deployment specification, ensuring reproducibility.
The YAML format provides a hierarchical structure that naturally maps to the layered architecture of TorchServe:
- Frontend layer: Worker management, batching, timeouts
- Process layer: torchrun settings, process count
- Framework layer: PiPPy/DeepSpeed-specific parameters
- Handler layer: Model-specific settings
This hierarchy ensures that changes at one layer (e.g., increasing GPU count) do not require changes at other layers (e.g., handler logic), promoting modularity and maintainability.
Related Pages
- Implementation:Pytorch_Serve_Parallelism_Model_Config - Concrete model-config.yaml examples
- Pytorch_Serve_Parallelism_Strategy - Choosing the parallelism strategy to configure
- Pytorch_Serve_Distributed_Worker - How configuration affects worker process management
- Pytorch_Serve_Pipeline_Parallelism - PiPPy-specific configuration parameters
- Pytorch_Serve_DeepSpeed_Inference - DeepSpeed-specific configuration parameters