Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve TorchModelServiceWorker

From Leeroopedia
Field Value
Page Type Implementation (API Doc)
Title TorchModelServiceWorker
Implements Principle:Pytorch_Serve_Distributed_Worker
Source ts/model_service_worker.py
Repository TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

TorchModelServiceWorker is the backend worker process that handles model loading and inference in TorchServe. For distributed large model inference, each torchrun-spawned process creates its own TorchModelServiceWorker instance with a rank-specific socket. The worker reads the LOCAL_RANK, WORLD_SIZE, RANK, and LOCAL_WORLD_SIZE environment variables to determine its position in the distributed topology. The critical design feature is that only the rank 0 worker sends inference responses back to the frontend, while all ranks participate in computation.

Description

The TorchModelServiceWorker class manages the lifecycle of a single backend worker process:

1. Initialization: The constructor accepts socket type, name, host, port, and metrics config. For distributed inference, the socket name or port is offset by LOCAL_RANK:

  • Unix sockets: s_name_parts[0] + "." + str(int(s_name_parts[1]) + LOCAL_RANK)
  • TCP sockets: int(port_num) + LOCAL_RANK

2. Model Loading: The load_model() method receives a load request from the frontend containing model path, name, handler, GPU ID, batch size, and other parameters. It uses ModelLoaderFactory to load the model and returns a service object.

3. Connection Handling: The handle_connection() method runs a loop that receives commands from the frontend:

  • Command b"I": Inference request. Calls service.predict(msg) and sends the response only if LOCAL_RANK == 0.
  • Command b"L": Load model request. Calls self.load_model(msg) and sends the result to the frontend.

4. Async Mode: The handle_connection_async() method provides an alternative connection mode that loads the model and then delegates to AsyncService for asynchronous request handling.

5. Server Loop: The run_server() method binds the socket, listens for connections, and dispatches to the appropriate connection handler.

Usage

Code Reference

Source Location: ts/model_service_worker.py (lines 30-268)

Signature:

class TorchModelServiceWorker(object):
    """
    Backend worker to handle Model Server's python service code
    """

    def __init__(
        self,
        s_type: Optional[str] = None,
        s_name: Optional[str] = None,
        host_addr: Optional[str] = None,
        port_num: Optional[int] = None,
        metrics_config: Optional[str] = None,
        async_comm: Optional[bool] = False,
    ):

Environment Variables (module-level constants):

LOCAL_RANK = int(os.getenv("LOCAL_RANK", 0))
WORLD_SIZE = int(os.getenv("WORLD_SIZE", 0))
WORLD_RANK = int(os.getenv("RANK", 0))
LOCAL_WORLD_SIZE = int(os.getenv("LOCAL_WORLD_SIZE", 0))

Socket Name Derivation (Unix, line 56):

s_name_new = s_name_parts[0] + "." + str(int(s_name_parts[1]) + LOCAL_RANK)

Port Offset (TCP, line 70):

self.port = int(port_num) + LOCAL_RANK

Rank 0 Response Guard (handle_connection, line 186):

if cmd == b"I":
    if service is not None:
        resp = service.predict(msg)
        if LOCAL_RANK == 0:
            cl_socket.sendall(resp)
        else:
            logging.info("skip sending response at rank %d", LOCAL_RANK)

Import:

from ts.model_service_worker import TorchModelServiceWorker

Module-level Constants:

MAX_FAILURE_THRESHOLD = 5
SOCKET_ACCEPT_TIMEOUT = 30.0
DEBUG = False
BENCHMARK = os.getenv("TS_BENCHMARK") in ["True", "true", "TRUE"]

I/O Contract

Constructor Inputs:

Parameter Type Description
s_type Optional[str] Socket type: "unix" or "tcp"
s_name Optional[str] Socket name (for Unix sockets), e.g., "/tmp/ts_worker.0"
host_addr Optional[str] Host address (for TCP sockets), defaults to "127.0.0.1"
port_num Optional[int] Base port number (for TCP sockets), offset by LOCAL_RANK
metrics_config Optional[str] Path to metrics configuration YAML file
async_comm Optional[bool] Whether to use async communication mode (default False)

load_model() Input (binary protocol message):

Field Type Description
command str "load"
modelPath str Path to model files
modelName str Model name
gpu int or None GPU ID (None for CPU)
handler str Handler entry point
envelope str Request data wrapper name (optional)
batchSize int Batch size
limitMaxImagePixels bool Whether to limit PIL max image pixels

load_model() Output:

  • Tuple of (service, message, code):
    • service: The loaded model service object (or None on failure)
    • message: Status message string
    • code: HTTP status code (200 for success, 507 for OOM, 500 for other errors)

handle_connection() Protocol:

  • Receives binary commands via socket: b"I" (inference) or b"L" (load)
  • For inference: calls service.predict(msg), sends response only from rank 0
  • For load: calls self.load_model(msg), sends load response from all ranks

Usage Examples

Worker startup (from __main__):

# This is how TorchServe (via torchrun) starts each worker process
args = ArgParser.model_service_worker_args().parse_args()
worker = TorchModelServiceWorker(
    sock_type,        # "unix" or "tcp"
    socket_name,      # e.g., "/tmp/ts_worker.0"
    host,             # e.g., "127.0.0.1"
    port,             # e.g., 9000 (will become 9000+LOCAL_RANK)
    metrics_config,   # path to metrics YAML
    async_comm,       # True/False
)
worker.run_server()

Example: 4-GPU distributed inference socket layout:

For a model with nproc-per-node: 4 and base socket name /tmp/ts_worker.0:

Process LOCAL_RANK Socket Name Sends Response?
Worker-rank0 0 /tmp/ts_worker.0 Yes
Worker-rank1 1 /tmp/ts_worker.1 No (skipped)
Worker-rank2 2 /tmp/ts_worker.2 No (skipped)
Worker-rank3 3 /tmp/ts_worker.3 No (skipped)

Example: TCP socket layout with base port 9000:

Process LOCAL_RANK Address:Port Sends Response?
Worker-rank0 0 127.0.0.1:9000 Yes
Worker-rank1 1 127.0.0.1:9001 No (skipped)
Worker-rank2 2 127.0.0.1:9002 No (skipped)
Worker-rank3 3 127.0.0.1:9003 No (skipped)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment