Principle:Spotify Luigi Worker Scheduler Communication
Overview
Worker-Scheduler Communication is the protocol by which distributed task executors register work, report status, and receive assignments from a central scheduling service over a network interface.
Description
In a centralized pipeline orchestration architecture, the scheduler and workers run as separate processes -- often on different machines. The communication protocol between them must handle several concerns:
- Task registration: Workers must declare which tasks they can execute, along with each task's dependencies, priority, and resource requirements. The scheduler uses this information to build and maintain the global task dependency graph.
- Work assignment: Workers poll the scheduler for tasks that are ready to execute. The scheduler selects tasks whose dependencies have been satisfied and whose resource requirements can be met, then assigns them to the requesting worker.
- Status reporting: As workers execute tasks, they report state transitions (started, completed, failed) back to the scheduler so it can update the global view and unblock downstream tasks.
- Heartbeat and liveness: Workers periodically send heartbeat messages to prove they are still active. The scheduler uses these signals to detect worker failures and reassign tasks that were running on disconnected workers.
- Transparent proxying: The communication layer should present the same interface regardless of whether the scheduler is local (in-process) or remote (over the network), allowing code to be tested locally and deployed to a distributed environment without changes.
This separation of concerns enables horizontal scaling: any number of workers can connect to a single scheduler, each pulling and executing tasks independently. The protocol is stateless from the worker's perspective -- each RPC call is self-contained, and the scheduler maintains all persistent state.
Usage
Use a Worker-Scheduler Communication protocol when:
- Pipeline workers run on multiple machines and need to coordinate through a central scheduler.
- You need reliable task handoff with retry logic to handle transient network failures.
- The same task code must run against both local (in-process) and remote (network) schedulers.
- Workers must register dependencies dynamically at runtime as they discover the task graph.
- Liveness detection is required to automatically recover from worker failures.
Theoretical Basis
The Worker-Scheduler Communication protocol follows a pull-based work distribution model with RPC-over-HTTP transport:
- Proxy pattern: The
RemoteSchedulerclass implements the same interface as the localSchedulerclass. Methods decorated with@rpc_method()on theSchedulerare automatically mirrored ontoRemoteScheduler, where each method call is translated into an HTTP POST to/api/{method_name}with JSON-encoded arguments. - Registration phase: When a worker's
add()method is called with a root task, it traverses the task's dependency tree. For each task, it callsscheduler.add_task()to register the task, its dependencies, status, and worker identity. This builds the scheduler's view of the DAG incrementally. - Execution loop: The worker's
run()method enters a loop that repeatedly callsscheduler.get_work(). The scheduler returns either a task ID to execute orNoneif no work is available. If no work is available and no tasks are running, the worker enters an idle state and eventually exits. - Retry with backoff: Network calls to the scheduler are wrapped in a retry mechanism using the
tenacitylibrary. Failed requests are retried up torpc-retry-attemptstimes (default 3) with a fixed wait ofrpc-retry-waitseconds (default 30) between attempts. - Connection management: The transport layer supports multiple HTTP backends --
requests(with connection pooling via sessions) when available, orurllibas a fallback. Unix domain sockets are supported viarequests-unixsocketfor co-located deployments. - Heartbeat mechanism: A dedicated
KeepAliveThreadruns in the background, periodically callingscheduler.ping()to maintain the worker's active registration. The scheduler uses these pings to track worker liveness and detect disconnections.