Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:ArroyoSystems Arroyo Stateful Operator TTL

From Leeroopedia




Knowledge Sources
Domains Optimization, Stream_Processing
Last Updated 2026-02-08 08:00 GMT

Overview

Stateful operator TTL heuristic: always explicitly set TTL for updating aggregates and expiring joins to avoid the 24-hour default, which causes unbounded state growth for high-cardinality keys.

Description

Arroyo's stateful operators (incremental aggregates, join-with-expiration) maintain state keyed by group-by columns. Each state entry has a Time-To-Live (TTL) that determines when old entries are evicted. If TTL is not explicitly configured, the system logs a warning and falls back to a 24-hour default. For high-cardinality group keys (e.g., user IDs, IP addresses), the 24-hour default can lead to significant state growth, increasing checkpoint size and recovery time.

Usage

Apply this heuristic when using updating/incremental aggregates or joins with expiration in SQL pipelines. Always set the TTL explicitly based on your data retention requirements. Monitor checkpoint sizes to detect state growth problems.

The Insight (Rule of Thumb)

  • Action: Always set TTL explicitly for updating aggregates and expiring joins. Watch for the warning: "ttl was not set for updating aggregate".
  • Value: Default fallback: 24 hours (86,400,000,000 microseconds). Set based on business logic: typical range 1 minute to 24 hours.
  • Trade-off: Shorter TTL = smaller state + faster checkpoints + risk of losing late-arriving data. Longer TTL = larger state + slower checkpoints + better late data handling.
  • Signal: If you see checkpoint sizes growing linearly over time, TTL may be too long for your key cardinality.

Reasoning

The 24-hour default was chosen as a safe fallback that accommodates most use cases without immediate data loss. However, for a pipeline grouping by user_id with 1 million unique users, each maintaining aggregate state, 24 hours of accumulated state can become problematic:

  • Checkpoint serialization time grows linearly with state size
  • Recovery time increases as more Parquet files must be deserialized
  • Memory usage on workers grows until TTL-based eviction kicks in

For joins with expiration, the same principle applies: unmatched join sides accumulate in state until TTL expires them. The warning "TTL was not set for join with expiration" indicates the same 24-hour default is in effect.

Code Evidence

TTL fallback with warning from `incremental_aggregator.rs:1043-1048`:

let ttl = Duration::from_micros(if config.ttl_micros == 0 {
    warn!("ttl was not set for updating aggregate");
    24 * 60 * 60 * 1000 * 1000  // 24 hours in microseconds
} else {
    config.ttl_micros
});

Similar pattern for joins from `join_with_expiration.rs:246`:

warn!("TTL was not set for join with expiration");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment