Principle:Spotify Luigi Scheduler Daemon

Overview

A Scheduler Daemon is a long-running central service that coordinates task execution across distributed workers by maintaining a global view of the task dependency graph, assigning work, and persisting state across restarts.

Description

In production pipeline orchestration, tasks are submitted by multiple workers running on different machines. Without a central coordinator, each worker would have an incomplete picture of the overall pipeline state, leading to duplicated work, missed dependencies, and inconsistent failure handling.

A Scheduler Daemon solves this by running as a persistent background process (daemon) that:

Maintains a global task graph: The scheduler holds the complete directed acyclic graph (DAG) of all registered tasks and their dependencies, enabling it to determine which tasks are ready to execute.
Assigns work to workers: When a worker requests a task, the scheduler selects the highest-priority task whose dependencies have all been satisfied and whose required resources are available.
Tracks task lifecycle: Every task progresses through a defined set of states (PENDING, RUNNING, DONE, FAILED, DISABLED), and the scheduler records each transition.
Persists state to disk: The scheduler periodically serializes its in-memory state to a pickle file, allowing it to recover the task graph after a process restart without losing knowledge of completed or in-progress tasks.
Prunes stale entries: A periodic pruning loop removes completed tasks that have exceeded their retention period, disconnected workers that have stopped sending heartbeats, and expired failure records.
Exposes a REST API: The scheduler listens on a network port and accepts JSON-encoded RPC calls over HTTP, enabling language-agnostic communication with workers and monitoring tools.

This pattern separates the decision-making (what to run next) from the execution (actually running the task), enabling horizontal scaling of workers while keeping scheduling logic centralized and consistent.

Usage

Deploy a Scheduler Daemon when:

Pipelines run across multiple machines or processes and need a single source of truth for task state.
Tasks have complex dependency chains that require global coordination to resolve correctly.
Production reliability demands that scheduler state survives process restarts and system reboots.
Operations teams need visibility into pipeline state through a web interface or API.
Resource constraints (e.g., limited database connections) require centralized allocation across workers.

Theoretical Basis

The Scheduler Daemon implements a centralized work-stealing scheduler with the following algorithmic properties:

Task registration: Workers submit tasks to the scheduler via add_task RPC calls. Each task declaration includes the task identifier, its dependencies, priority, resource requirements, and the declaring worker. The scheduler inserts or updates the task in its internal graph.
Dependency resolution: When a worker calls get_work, the scheduler traverses the task graph to find tasks in PENDING state whose dependencies are all in DONE state. Among eligible tasks, it selects the one with the highest priority.
Resource gating: Before assigning a task, the scheduler checks that the task's resource requirements can be satisfied given currently running tasks. If resources are insufficient, the task is skipped and the next eligible candidate is considered.
Failure management: Tasks that fail are tracked against a configurable retry policy defined by three parameters: retry_count (maximum failures before disabling), disable_window (time window for counting failures), and disable_hard_timeout (maximum wall-clock time before forced disable). When a task exceeds its retry budget, the scheduler moves it to DISABLED state.
State persistence: The scheduler's in-memory state (all tasks, workers, and their relationships) is serialized to a pickle file at state_path. On startup, the scheduler loads this file to restore the previous state. On shutdown (via SIGINT, SIGTERM, or SIGQUIT), the state is dumped before the process exits.
Periodic pruning: A timer (default every 60 seconds) triggers a pruning cycle that removes workers that have not sent a heartbeat within worker_disconnect_delay seconds, tasks that have been completed for longer than remove_delay seconds, and expired batch email records.
Daemonization: For production deployments, the scheduler process can detach from the terminal using the python-daemon library, redirecting stdout/stderr to log files and writing a PID file for process management.

Related Pages

Implementation:Spotify_Luigi_Luigid_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment