Principle:Pytorch Serve Distributed Worker
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Distributed Worker Management |
| Domains | Distributed_Computing, Infrastructure |
| Knowledge Sources | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Distributed worker management in TorchServe handles the coordination of multiple worker processes spawned by torchrun for large model inference. When a model requires multiple GPUs, TorchServe uses torchrun to launch one process per GPU. Each process connects to the TorchServe frontend via a unique socket (derived from the base socket name plus the process's local rank), processes inference requests in parallel, and only the rank 0 process sends responses back to the frontend. This design ensures that all GPU processes participate in distributed computation while avoiding duplicate responses.
Description
When TorchServe serves a model with parallelType set to "pp", "tp", or "pptp", the following worker management process occurs:
1. Process Spawning: TorchServe's frontend launches the backend worker via torchrun instead of a direct Python process. torchrun spawns N processes (one per GPU, as specified by nproc-per-node). Each process receives environment variables:
LOCAL_RANK: The rank of this process on the current node (0 to N-1).WORLD_SIZE: Total number of processes across all nodes.RANK: Global rank of this process.LOCAL_WORLD_SIZE: Number of processes on the current node.
2. Socket Assignment: Each spawned process creates its own socket connection to the TorchServe frontend. The socket name is derived from the base socket name by offsetting with the local rank:
- For Unix sockets: If the base name is
worker.0, rank 0 usesworker.0, rank 1 usesworker.1, etc. The formula iss_name_parts[0] + "." + str(int(s_name_parts[1]) + LOCAL_RANK). - For TCP sockets: The port number is offset by the local rank:
int(port_num) + LOCAL_RANK.
3. Model Loading: Each process receives the load model command over its socket, loads the model, and initializes its portion of the distributed computation (e.g., its pipeline stage or tensor shard).
4. Inference Routing: When an inference request arrives:
- The frontend sends the request to all worker processes (one per socket).
- Each process receives the request, runs its part of the distributed inference.
- Only rank 0 sends the response back to the frontend. All other ranks log a skip message:
"skip sending response at rank %d".
This rank-0-only response pattern prevents duplicate responses while ensuring all processes participate in the distributed computation needed to produce the result.
5. Rank-Aware Execution: The handler code is rank-aware through the environment variables. For example, BasePippyHandler reads LOCAL_RANK and WORLD_SIZE to determine its position in the pipeline and initialize the correct RPC worker.
Usage
The distributed worker management is transparent to handler developers in most cases. Key points to be aware of:
- Socket uniqueness: Each torchrun process gets its own socket. The frontend expects exactly
nproc-per-nodesocket connections per worker. - Rank 0 response: Only the rank 0 process should send responses. This is enforced in
handle_connection()at the worker level and insend_intermediate_predict_response()for streaming. - Single worker per model: Large models typically require
minWorkers: 1, maxWorkers: 1because each worker uses multiple GPUs. - Timeout considerations: The
SOCKET_ACCEPT_TIMEOUTis 30 seconds. For very large models with slow initialization, the startup timeout inconfig.propertiesmay need to be increased.
When a handler needs to perform rank-specific logic (e.g., only rank 0 should log results or save state), it should check os.environ["LOCAL_RANK"] or the self.local_rank attribute set by the base handler.
Theoretical Basis
The distributed worker architecture in TorchServe follows the SPMD (Single Program, Multiple Data) execution model. All processes run the same handler code, but each process operates on its assigned portion of the model (determined by its rank) and its assigned GPU.
The rank-0-as-coordinator pattern is a common approach in distributed systems where one process is designated as the leader for external communication. In TorchServe's case:
- All ranks receive input (broadcast pattern).
- All ranks participate in distributed computation (collective pattern).
- Only rank 0 produces the final output (gather/reduce pattern).
The socket-per-rank design provides independent communication channels between the frontend and each backend process. This avoids the need for a separate inter-process communication layer for input distribution and allows the frontend to independently manage the lifecycle of each process.
The torchrun launcher handles process lifecycle management, including:
- Elastic training/inference: Processes can be restarted on failure.
- Environment setup: Automatically configures distributed environment variables.
- Rendezvous: Coordinates process startup to ensure all ranks are ready before computation begins.
Related Pages
- Implementation:Pytorch_Serve_TorchModelServiceWorker - Worker process implementation with rank-aware sockets
- Pytorch_Serve_Parallelism_Strategy - How parallelType determines whether torchrun is used
- Pytorch_Serve_Distributed_Configuration - Configuration of torchrun parameters
- Pytorch_Serve_Streaming_Inference - Streaming responses with rank-aware sending