Principle:Spotify Luigi Time Range Scheduling

Knowledge Sources	Spotify_Luigi Luigi Docs
Domains	Scheduling, Time_Series
Last Updated	2026-02-10 08:00 GMT

Overview

Producing contiguous completed ranges of recurring time-parameterized tasks to ensure data completeness over time intervals.

Description

Time range scheduling is the practice of automatically identifying and executing all incomplete instances of a recurring time-parameterized task across a specified time interval. Many data pipelines operate on time-partitioned data: hourly log aggregation, daily report generation, weekly rollups. When a pipeline instance fails or is not run for a period, gaps appear in the data. Time range scheduling solves this by examining the full range of expected time intervals, identifying which intervals have not been successfully completed, and scheduling the missing tasks. This is commonly known as backfilling. Rather than requiring operators to manually identify and trigger missing intervals, the system automatically determines the contiguous range of completed data and works to extend it.

Usage

Use time range scheduling when the pipeline processes time-partitioned data on a recurring basis, when gaps in data coverage are unacceptable, when backfilling historical data is needed after deploying new pipeline logic, or when the pipeline must guarantee that all time intervals within a range have been processed before downstream consumers can rely on the data.

Theoretical Basis

Time range scheduling is based on interval completeness analysis over a discrete time domain:

1. Time Domain Discretization -- The continuous time axis is divided into discrete intervals of uniform duration (hourly, daily, weekly). Each interval is identified by its start timestamp:
   intervals = {t_start, t_start + delta, t_start + 2*delta, ..., t_end - delta}
   where delta is the interval duration.
2. Completeness Scan -- For each interval in the range, the system checks whether the corresponding task instance has been successfully completed. This produces a boolean vector:
   completeness[i] = EXISTS(output(task(interval[i])))
3. Gap Identification -- The system identifies the set of incomplete intervals:
   gaps = {interval[i] : completeness[i] = FALSE}
4. Scheduling Strategy -- Several strategies exist for ordering the execution of missing intervals:
   * Forward fill -- Process gaps from oldest to newest, ensuring temporal ordering of outputs
   * Reverse fill -- Process gaps from newest to oldest, prioritizing the most recent data
   * Contiguous extension -- Find the latest contiguous completed range and extend it forward, ensuring no internal gaps
5. Dependency Propagation -- Each time-parameterized task instance may have its own dependencies (upstream data for that time interval). The scheduler ensures that a task instance is only run when its specific dependencies are satisfied.
6. Boundary Management -- The range is bounded by:
   * Start boundary -- The earliest interval to consider (configured or derived from data availability)
   * End boundary -- Typically the current time minus a lateness allowance, accounting for the fact that data for the most recent intervals may not yet be available
7. Convergence -- On each scheduling cycle, the system processes a batch of missing intervals. Over successive cycles, the set of gaps shrinks until the entire range is complete, at which point only the newest interval (as time advances) needs processing.

The key invariant is range contiguity: the system aims to maintain a contiguous block of completed intervals with no internal gaps, which is essential for downstream consumers that assume complete data coverage.

Related Pages

Implementation:Spotify_Luigi_RangeTask

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment