Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Spotify Luigi Worker Scheduler Communication

From Leeroopedia
Revision as of 18:19, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Spotify_Luigi_Worker_Scheduler_Communication.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Metadata

Overview

Worker-Scheduler Communication is the protocol by which distributed task executors register work, report status, and receive assignments from a central scheduling service over a network interface.

Description

In a centralized pipeline orchestration architecture, the scheduler and workers run as separate processes -- often on different machines. The communication protocol between them must handle several concerns:

  • Task registration: Workers must declare which tasks they can execute, along with each task's dependencies, priority, and resource requirements. The scheduler uses this information to build and maintain the global task dependency graph.
  • Work assignment: Workers poll the scheduler for tasks that are ready to execute. The scheduler selects tasks whose dependencies have been satisfied and whose resource requirements can be met, then assigns them to the requesting worker.
  • Status reporting: As workers execute tasks, they report state transitions (started, completed, failed) back to the scheduler so it can update the global view and unblock downstream tasks.
  • Heartbeat and liveness: Workers periodically send heartbeat messages to prove they are still active. The scheduler uses these signals to detect worker failures and reassign tasks that were running on disconnected workers.
  • Transparent proxying: The communication layer should present the same interface regardless of whether the scheduler is local (in-process) or remote (over the network), allowing code to be tested locally and deployed to a distributed environment without changes.

This separation of concerns enables horizontal scaling: any number of workers can connect to a single scheduler, each pulling and executing tasks independently. The protocol is stateless from the worker's perspective -- each RPC call is self-contained, and the scheduler maintains all persistent state.

Usage

Use a Worker-Scheduler Communication protocol when:

  • Pipeline workers run on multiple machines and need to coordinate through a central scheduler.
  • You need reliable task handoff with retry logic to handle transient network failures.
  • The same task code must run against both local (in-process) and remote (network) schedulers.
  • Workers must register dependencies dynamically at runtime as they discover the task graph.
  • Liveness detection is required to automatically recover from worker failures.

Theoretical Basis

The Worker-Scheduler Communication protocol follows a pull-based work distribution model with RPC-over-HTTP transport:

  1. Proxy pattern: The RemoteScheduler class implements the same interface as the local Scheduler class. Methods decorated with @rpc_method() on the Scheduler are automatically mirrored onto RemoteScheduler, where each method call is translated into an HTTP POST to /api/{method_name} with JSON-encoded arguments.
  2. Registration phase: When a worker's add() method is called with a root task, it traverses the task's dependency tree. For each task, it calls scheduler.add_task() to register the task, its dependencies, status, and worker identity. This builds the scheduler's view of the DAG incrementally.
  3. Execution loop: The worker's run() method enters a loop that repeatedly calls scheduler.get_work(). The scheduler returns either a task ID to execute or None if no work is available. If no work is available and no tasks are running, the worker enters an idle state and eventually exits.
  4. Retry with backoff: Network calls to the scheduler are wrapped in a retry mechanism using the tenacity library. Failed requests are retried up to rpc-retry-attempts times (default 3) with a fixed wait of rpc-retry-wait seconds (default 30) between attempts.
  5. Connection management: The transport layer supports multiple HTTP backends -- requests (with connection pooling via sessions) when available, or urllib as a fallback. Unix domain sockets are supported via requests-unixsocket for co-located deployments.
  6. Heartbeat mechanism: A dedicated KeepAliveThread runs in the background, periodically calling scheduler.ping() to maintain the worker's active registration. The scheduler uses these pings to track worker liveness and detect disconnections.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment