Principle:ArroyoSystems Arroyo Barrier Propagation
Barrier Propagation
Principle: Propagating checkpoint barriers through a dataflow graph. Barriers flow from sources to sinks, and each operator aligns barriers from all input channels before checkpointing its state.
Theoretical Basis
Barrier-based checkpointing derives from the Chandy-Lamport distributed snapshot algorithm, adapted for directed acyclic dataflow graphs. The core insight is that marker messages (called barriers in stream processing) can delineate consistent cuts across a distributed computation without pausing the entire pipeline.
Algorithm Overview
- Injection: The controller injects barrier messages at all source operators, marking the boundary between epoch n and epoch n+1.
- Propagation: Each operator receives barriers on its input channels. When a barrier arrives on one input, the operator may buffer records from that input while waiting for barriers on remaining inputs.
- Alignment: Once barriers have been received on all input channels, the operator has reached a consistent point -- all records from epoch n have been processed, and no records from epoch n+1 have been consumed.
- Checkpoint: At the alignment point, the operator snapshots its state (state tables, timers, watermarks) and forwards the barrier to downstream operators.
- Completion: When all sink operators have processed the barrier, the distributed snapshot for epoch n is complete.
Alignment Semantics
Barrier alignment ensures a consistent cut across the dataflow graph:
- Before alignment: Records arriving on channels that have already delivered a barrier are buffered. Records on channels that have not yet delivered a barrier continue to be processed normally.
- At alignment: All input channels have delivered the barrier. The operator's state reflects exactly the records from epoch n and earlier.
- After alignment: The operator checkpoints its state, forwards the barrier, and resumes normal processing of buffered and new records.
Operator Responsibilities
Each operator handles the barrier by:
- Flushing pending state: Any buffered or in-flight computations are completed and reflected in state tables.
- Allowing framework persistence: The operator signals readiness (returns
true) to let the framework persist state tables to durable storage. - Forwarding the barrier: After checkpointing, the barrier is sent to all output channels.
Consistency Guarantees
Barrier propagation with alignment guarantees that the resulting snapshot is a consistent distributed snapshot:
- No message sent after a barrier is included in the snapshot state of the receiver.
- All messages sent before a barrier are included in the snapshot state of the receiver.
- This provides exactly-once processing semantics upon recovery.
Domains
- Stream_Processing: Barriers are the mechanism that enables checkpointing without stopping the stream.
- Fault_Tolerance: Consistent snapshots via barrier alignment enable recovery to a well-defined state.
- Distributed_Systems: Barrier propagation is a practical implementation of the Chandy-Lamport snapshot algorithm for dataflow systems.