Principle:Spotify Luigi Task Persistence History

Overview

Task Persistence and History is the practice of durably recording every task execution event -- scheduling, start, completion, and failure -- to a persistent storage backend for auditing, debugging, and operational analysis.

Description

Pipeline schedulers maintain task state in memory for real-time scheduling decisions. However, in-memory state is volatile: it is lost on process restart and cannot answer historical questions such as "How many times did this task fail last week?" or "What was the average runtime of this task over the past month?"

Task Persistence and History addresses this by introducing two complementary persistence mechanisms:

Scheduler state persistence: The scheduler's complete in-memory state -- all tasks, workers, and their relationships -- is periodically serialized to a file (typically a pickle file). On startup, the scheduler loads this file to restore its state, enabling seamless recovery after restarts without losing knowledge of in-progress or completed tasks.
Task event history: A separate database-backed system records discrete events in the task lifecycle (PENDING, RUNNING, DONE, FAILED) with timestamps, task parameters, and host information. This creates an append-only audit trail that survives scheduler restarts and supports historical queries.

Together, these mechanisms serve different purposes:

State persistence ensures operational continuity -- the scheduler can resume where it left off.
Event history enables observability and auditing -- operators can query what happened, when, and why.

This separation follows the Event Sourcing pattern: the current state is derived from the in-memory model (optimized for scheduling decisions), while the historical record is stored as an append-only event log (optimized for queries and analysis).

Usage

Use task persistence and history when:

Production schedulers must survive process restarts without losing track of completed or in-progress tasks.
Compliance or auditing requirements mandate a record of all task executions and their outcomes.
Operators need to investigate failures by examining the sequence of events leading to a task's current state.
Performance monitoring requires historical runtime data to detect regressions or plan capacity.
The web-based visualizer needs to display task execution trends over time.

Theoretical Basis

Task persistence and history operates through two independent subsystems:

Scheduler State Persistence

Serialization format: The scheduler's SimpleTaskState object, containing all task records, worker registrations, and their relationships, is serialized using Python's pickle module to a file at a configurable state_path (default: /var/lib/luigi-server/state.pickle).
Load on startup: When the scheduler starts, it calls load(), which reads and deserializes the pickle file if it exists. This restores the task graph to its previous state, including tasks in PENDING, RUNNING, DONE, FAILED, and DISABLED states.
Dump on shutdown: When the scheduler receives a shutdown signal (SIGINT, SIGTERM, SIGQUIT) or the process exits via atexit, it calls dump() to serialize the current state to disk. This ensures that even abnormal shutdowns preserve state.
Atomic safety: The pickle file represents a point-in-time snapshot. If the process crashes between dumps, the last successful dump is used, which may result in some recent state changes being lost. This is an acceptable trade-off for the simplicity and speed of the mechanism.

Task Event History

Event recording: Each lifecycle transition (scheduled, started, finished/failed) generates a TaskEvent record with a timestamp and event name. These events are associated with a TaskRecord that stores the task's name, ID, parameters, and executing host.
Relational storage: Events are stored in a relational database (any SQLAlchemy-supported backend) across three tables: tasks (task records), task_events (lifecycle events with timestamps), and task_parameters (key-value pairs for task parameters).
Query capabilities: The history system supports queries by task name, task ID, record ID, parameter values, and time range (e.g., "all tasks updated in the past 24 hours").
Opt-in activation: Task history recording is disabled by default (record_task_history = False in the [scheduler] configuration section) and must be explicitly enabled along with a database connection string in the [task_history] section.

Related Pages

Implementation:Spotify_Luigi_DbTaskHistory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment