Principle:Spotify Luigi Task Persistence History
Overview
Task Persistence and History is the practice of durably recording every task execution event -- scheduling, start, completion, and failure -- to a persistent storage backend for auditing, debugging, and operational analysis.
Description
Pipeline schedulers maintain task state in memory for real-time scheduling decisions. However, in-memory state is volatile: it is lost on process restart and cannot answer historical questions such as "How many times did this task fail last week?" or "What was the average runtime of this task over the past month?"
Task Persistence and History addresses this by introducing two complementary persistence mechanisms:
- Scheduler state persistence: The scheduler's complete in-memory state -- all tasks, workers, and their relationships -- is periodically serialized to a file (typically a pickle file). On startup, the scheduler loads this file to restore its state, enabling seamless recovery after restarts without losing knowledge of in-progress or completed tasks.
- Task event history: A separate database-backed system records discrete events in the task lifecycle (PENDING, RUNNING, DONE, FAILED) with timestamps, task parameters, and host information. This creates an append-only audit trail that survives scheduler restarts and supports historical queries.
Together, these mechanisms serve different purposes:
- State persistence ensures operational continuity -- the scheduler can resume where it left off.
- Event history enables observability and auditing -- operators can query what happened, when, and why.
This separation follows the Event Sourcing pattern: the current state is derived from the in-memory model (optimized for scheduling decisions), while the historical record is stored as an append-only event log (optimized for queries and analysis).
Usage
Use task persistence and history when:
- Production schedulers must survive process restarts without losing track of completed or in-progress tasks.
- Compliance or auditing requirements mandate a record of all task executions and their outcomes.
- Operators need to investigate failures by examining the sequence of events leading to a task's current state.
- Performance monitoring requires historical runtime data to detect regressions or plan capacity.
- The web-based visualizer needs to display task execution trends over time.
Theoretical Basis
Task persistence and history operates through two independent subsystems:
Scheduler State Persistence
- Serialization format: The scheduler's
SimpleTaskStateobject, containing all task records, worker registrations, and their relationships, is serialized using Python'spicklemodule to a file at a configurablestate_path(default:/var/lib/luigi-server/state.pickle). - Load on startup: When the scheduler starts, it calls
load(), which reads and deserializes the pickle file if it exists. This restores the task graph to its previous state, including tasks in PENDING, RUNNING, DONE, FAILED, and DISABLED states. - Dump on shutdown: When the scheduler receives a shutdown signal (SIGINT, SIGTERM, SIGQUIT) or the process exits via
atexit, it callsdump()to serialize the current state to disk. This ensures that even abnormal shutdowns preserve state. - Atomic safety: The pickle file represents a point-in-time snapshot. If the process crashes between dumps, the last successful dump is used, which may result in some recent state changes being lost. This is an acceptable trade-off for the simplicity and speed of the mechanism.
Task Event History
- Event recording: Each lifecycle transition (scheduled, started, finished/failed) generates a
TaskEventrecord with a timestamp and event name. These events are associated with aTaskRecordthat stores the task's name, ID, parameters, and executing host. - Relational storage: Events are stored in a relational database (any SQLAlchemy-supported backend) across three tables:
tasks(task records),task_events(lifecycle events with timestamps), andtask_parameters(key-value pairs for task parameters). - Query capabilities: The history system supports queries by task name, task ID, record ID, parameter values, and time range (e.g., "all tasks updated in the past 24 hours").
- Opt-in activation: Task history recording is disabled by default (
record_task_history = Falsein the[scheduler]configuration section) and must be explicitly enabled along with a database connection string in the[task_history]section.