Principle:Pytorch Serve Distributed Worker

Field	Value
Page Type	Principle
Title	Distributed Worker Management
Domains	Distributed_Computing, Infrastructure
Knowledge Sources	TorchServe
Last Updated	2026-02-13 00:00 GMT

Overview

Distributed worker management in TorchServe handles the coordination of multiple worker processes spawned by torchrun for large model inference. When a model requires multiple GPUs, TorchServe uses torchrun to launch one process per GPU. Each process connects to the TorchServe frontend via a unique socket (derived from the base socket name plus the process's local rank), processes inference requests in parallel, and only the rank 0 process sends responses back to the frontend. This design ensures that all GPU processes participate in distributed computation while avoiding duplicate responses.

Description

When TorchServe serves a model with parallelType set to "pp", "tp", or "pptp", the following worker management process occurs:

1. Process Spawning: TorchServe's frontend launches the backend worker via torchrun instead of a direct Python process. torchrun spawns N processes (one per GPU, as specified by nproc-per-node). Each process receives environment variables:

LOCAL_RANK: The rank of this process on the current node (0 to N-1).
WORLD_SIZE: Total number of processes across all nodes.
RANK: Global rank of this process.
LOCAL_WORLD_SIZE: Number of processes on the current node.

2. Socket Assignment: Each spawned process creates its own socket connection to the TorchServe frontend. The socket name is derived from the base socket name by offsetting with the local rank:

For Unix sockets: If the base name is worker.0, rank 0 uses worker.0, rank 1 uses worker.1, etc. The formula is s_name_parts[0] + "." + str(int(s_name_parts[1]) + LOCAL_RANK).
For TCP sockets: The port number is offset by the local rank: int(port_num) + LOCAL_RANK.

3. Model Loading: Each process receives the load model command over its socket, loads the model, and initializes its portion of the distributed computation (e.g., its pipeline stage or tensor shard).

4. Inference Routing: When an inference request arrives:

The frontend sends the request to all worker processes (one per socket).
Each process receives the request, runs its part of the distributed inference.
Only rank 0 sends the response back to the frontend. All other ranks log a skip message: "skip sending response at rank %d".

This rank-0-only response pattern prevents duplicate responses while ensuring all processes participate in the distributed computation needed to produce the result.

5. Rank-Aware Execution: The handler code is rank-aware through the environment variables. For example, BasePippyHandler reads LOCAL_RANK and WORLD_SIZE to determine its position in the pipeline and initialize the correct RPC worker.

Usage

The distributed worker management is transparent to handler developers in most cases. Key points to be aware of:

Socket uniqueness: Each torchrun process gets its own socket. The frontend expects exactly nproc-per-node socket connections per worker.
Rank 0 response: Only the rank 0 process should send responses. This is enforced in handle_connection() at the worker level and in send_intermediate_predict_response() for streaming.
Single worker per model: Large models typically require minWorkers: 1, maxWorkers: 1 because each worker uses multiple GPUs.
Timeout considerations: The SOCKET_ACCEPT_TIMEOUT is 30 seconds. For very large models with slow initialization, the startup timeout in config.properties may need to be increased.

When a handler needs to perform rank-specific logic (e.g., only rank 0 should log results or save state), it should check os.environ["LOCAL_RANK"] or the self.local_rank attribute set by the base handler.

Theoretical Basis

The distributed worker architecture in TorchServe follows the SPMD (Single Program, Multiple Data) execution model. All processes run the same handler code, but each process operates on its assigned portion of the model (determined by its rank) and its assigned GPU.

The rank-0-as-coordinator pattern is a common approach in distributed systems where one process is designated as the leader for external communication. In TorchServe's case:

All ranks receive input (broadcast pattern).
All ranks participate in distributed computation (collective pattern).
Only rank 0 produces the final output (gather/reduce pattern).

The socket-per-rank design provides independent communication channels between the frontend and each backend process. This avoids the need for a separate inter-process communication layer for input distribution and allows the frontend to independently manage the lifecycle of each process.

The torchrun launcher handles process lifecycle management, including:

Elastic training/inference: Processes can be restarted on failure.
Environment setup: Automatically configures distributed environment variables.
Rendezvous: Coordinates process startup to ensure all ranks are ready before computation begins.

Related Pages

Implementation:Pytorch_Serve_TorchModelServiceWorker - Worker process implementation with rank-aware sockets
Pytorch_Serve_Parallelism_Strategy - How parallelType determines whether torchrun is used
Pytorch_Serve_Distributed_Configuration - Configuration of torchrun parameters
Pytorch_Serve_Streaming_Inference - Streaming responses with rank-aware sending

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment