Principle:ArroyoSystems Arroyo Checkpoint Triggering
Checkpoint Triggering
Principle: Timer-based triggering of distributed checkpoints. The controller periodically initiates checkpoints based on a configured interval, coordinating the start of a new checkpoint epoch across all operators.
Theoretical Basis
Periodic checkpointing in stream processing follows the Chandy-Lamport distributed snapshot algorithm. The trigger mechanism is the entry point into each snapshot cycle and must satisfy several invariants to maintain correctness and efficiency.
Core Invariants
- No concurrent checkpoints: Only one checkpoint may be in-flight at a time. Allowing multiple concurrent checkpoints would create ambiguity in barrier ordering and state serialization, potentially leading to inconsistent snapshots.
- No interference with cleanup tasks: Checkpoint triggering must be deferred while old checkpoint data is being cleaned up. Initiating a new checkpoint while garbage-collecting a previous epoch's state could cause data races on shared storage paths.
- Configurable interval: The checkpoint interval balances two competing concerns:
- Shorter intervals reduce recovery time (less work to replay) but increase overhead (more frequent state serialization and I/O).
- Longer intervals reduce overhead but increase recovery time and the amount of reprocessed data after failure.
Checkpoint Lifecycle
The triggering mechanism initiates the following sequence:
- The controller checks whether the elapsed time since the last checkpoint exceeds the configured interval.
- If no checkpoint is currently in progress and no cleanup task is running, the controller initiates a new checkpoint.
- A new
CheckpointBarrieris created with the current epoch number and broadcast to all source operators. - Source operators inject the barrier into their output streams, beginning the barrier propagation phase.
Relationship to Chandy-Lamport
In the classical Chandy-Lamport algorithm, any process can initiate a snapshot by recording its own state and sending marker messages on all outgoing channels. In Arroyo's adaptation:
- The controller acts as the initiator, ensuring centralized coordination.
- Checkpoint barriers serve as the marker messages.
- The epoch number provides a total ordering of snapshots.
Domains
- Stream_Processing: Checkpointing is fundamental to fault-tolerant stream processing, enabling recovery without data loss.
- Fault_Tolerance: Periodic checkpoints bound the amount of reprocessing needed after failure.
- Checkpointing: The trigger mechanism is the first phase of the checkpoint protocol.
Related Implementation
Implementation:ArroyoSystems_Arroyo_Checkpoint_Trigger Heuristic:ArroyoSystems_Arroyo_Checkpoint_Interval_Tuning