Heuristic:ArroyoSystems Arroyo Checkpoint Interval Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Stream_Processing |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Checkpoint interval tuning heuristic: the default 10-second interval balances state consistency against I/O overhead, with compaction available to manage accumulated checkpoint files.
Description
Arroyo uses periodic checkpointing (Chandy-Lamport style barrier propagation) to capture consistent distributed snapshots of pipeline state. The checkpoint interval determines how frequently these snapshots are taken. Each checkpoint involves barrier propagation through the dataflow graph, state serialization to Parquet format, and upload to object storage. The interval directly affects recovery time (lower = less work to replay), storage costs (lower = more checkpoint files), and throughput (lower = more I/O overhead).
Usage
Apply this heuristic when tuning pipeline performance or configuring recovery guarantees. If you observe high checkpoint overhead (visible in metrics), increase the interval. If you need faster recovery from failures, decrease the interval. For exactly-once sink delivery via two-phase commit, the checkpoint interval also determines the commit frequency.
The Insight (Rule of Thumb)
- Action: Set `default-checkpoint-interval` in configuration or use `ARROYO__DEFAULT_CHECKPOINT_INTERVAL` environment variable.
- Value: Default is `10s`. Range: 1s (aggressive) to 5m (relaxed). Typical production: 10s-60s.
- Trade-off: Lower interval = faster recovery + more I/O overhead + more storage. Higher interval = less overhead + slower recovery + potential data replay on failure.
- Compaction: Enable `pipeline.compaction.enabled = true` with `checkpoints-to-compact = 4` to merge small checkpoint files and reduce storage fragmentation.
Reasoning
Each checkpoint triggers:
- Barrier injection at sources
- Barrier alignment at multi-input operators
- State serialization (Parquet) at every stateful operator
- Upload to object storage
- Coordination acknowledgment at the controller
For pipelines with large state (e.g., windowed aggregations with many keys), the serialization and upload phases dominate. The 10-second default provides a good balance for typical workloads. The compaction feature (disabled by default) can merge multiple small checkpoint files to reduce the number of files in object storage.
The healthy-duration setting (default 2m) interacts with checkpointing: after 2 minutes of healthy operation, the restart counter resets. This means checkpoint intervals much longer than 2 minutes could result in significant state loss during recovery.
Code Evidence
Default configuration from `default.toml:1-2`:
checkpoint-url = "/tmp/arroyo/checkpoints"
default-checkpoint-interval = "10s"
Compaction configuration from `default.toml:17-19`:
[pipeline.compaction]
enabled = false
checkpoints-to-compact = 4
Checkpoint trigger from `job_controller/mod.rs:843-850`:
// Checkpoint is triggered on a timer interval
// configured via default-checkpoint-interval