Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ArroyoSystems Arroyo Checkpoint Interval Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, Stream_Processing
Last Updated 2026-02-08 08:00 GMT

Overview

Checkpoint interval tuning heuristic: the default 10-second interval balances state consistency against I/O overhead, with compaction available to manage accumulated checkpoint files.

Description

Arroyo uses periodic checkpointing (Chandy-Lamport style barrier propagation) to capture consistent distributed snapshots of pipeline state. The checkpoint interval determines how frequently these snapshots are taken. Each checkpoint involves barrier propagation through the dataflow graph, state serialization to Parquet format, and upload to object storage. The interval directly affects recovery time (lower = less work to replay), storage costs (lower = more checkpoint files), and throughput (lower = more I/O overhead).

Usage

Apply this heuristic when tuning pipeline performance or configuring recovery guarantees. If you observe high checkpoint overhead (visible in metrics), increase the interval. If you need faster recovery from failures, decrease the interval. For exactly-once sink delivery via two-phase commit, the checkpoint interval also determines the commit frequency.

The Insight (Rule of Thumb)

  • Action: Set `default-checkpoint-interval` in configuration or use `ARROYO__DEFAULT_CHECKPOINT_INTERVAL` environment variable.
  • Value: Default is `10s`. Range: 1s (aggressive) to 5m (relaxed). Typical production: 10s-60s.
  • Trade-off: Lower interval = faster recovery + more I/O overhead + more storage. Higher interval = less overhead + slower recovery + potential data replay on failure.
  • Compaction: Enable `pipeline.compaction.enabled = true` with `checkpoints-to-compact = 4` to merge small checkpoint files and reduce storage fragmentation.

Reasoning

Each checkpoint triggers:

  1. Barrier injection at sources
  2. Barrier alignment at multi-input operators
  3. State serialization (Parquet) at every stateful operator
  4. Upload to object storage
  5. Coordination acknowledgment at the controller

For pipelines with large state (e.g., windowed aggregations with many keys), the serialization and upload phases dominate. The 10-second default provides a good balance for typical workloads. The compaction feature (disabled by default) can merge multiple small checkpoint files to reduce the number of files in object storage.

The healthy-duration setting (default 2m) interacts with checkpointing: after 2 minutes of healthy operation, the restart counter resets. This means checkpoint intervals much longer than 2 minutes could result in significant state loss during recovery.

Code Evidence

Default configuration from `default.toml:1-2`:

checkpoint-url = "/tmp/arroyo/checkpoints"
default-checkpoint-interval = "10s"

Compaction configuration from `default.toml:17-19`:

[pipeline.compaction]
enabled = false
checkpoints-to-compact = 4

Checkpoint trigger from `job_controller/mod.rs:843-850`:

// Checkpoint is triggered on a timer interval
// configured via default-checkpoint-interval

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment