Heuristic:ArroyoSystems Arroyo Stateful Operator TTL
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Stream_Processing |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Stateful operator TTL heuristic: always explicitly set TTL for updating aggregates and expiring joins to avoid the 24-hour default, which causes unbounded state growth for high-cardinality keys.
Description
Arroyo's stateful operators (incremental aggregates, join-with-expiration) maintain state keyed by group-by columns. Each state entry has a Time-To-Live (TTL) that determines when old entries are evicted. If TTL is not explicitly configured, the system logs a warning and falls back to a 24-hour default. For high-cardinality group keys (e.g., user IDs, IP addresses), the 24-hour default can lead to significant state growth, increasing checkpoint size and recovery time.
Usage
Apply this heuristic when using updating/incremental aggregates or joins with expiration in SQL pipelines. Always set the TTL explicitly based on your data retention requirements. Monitor checkpoint sizes to detect state growth problems.
The Insight (Rule of Thumb)
- Action: Always set TTL explicitly for updating aggregates and expiring joins. Watch for the warning: "ttl was not set for updating aggregate".
- Value: Default fallback: 24 hours (86,400,000,000 microseconds). Set based on business logic: typical range 1 minute to 24 hours.
- Trade-off: Shorter TTL = smaller state + faster checkpoints + risk of losing late-arriving data. Longer TTL = larger state + slower checkpoints + better late data handling.
- Signal: If you see checkpoint sizes growing linearly over time, TTL may be too long for your key cardinality.
Reasoning
The 24-hour default was chosen as a safe fallback that accommodates most use cases without immediate data loss. However, for a pipeline grouping by user_id with 1 million unique users, each maintaining aggregate state, 24 hours of accumulated state can become problematic:
- Checkpoint serialization time grows linearly with state size
- Recovery time increases as more Parquet files must be deserialized
- Memory usage on workers grows until TTL-based eviction kicks in
For joins with expiration, the same principle applies: unmatched join sides accumulate in state until TTL expires them. The warning "TTL was not set for join with expiration" indicates the same 24-hour default is in effect.
Code Evidence
TTL fallback with warning from `incremental_aggregator.rs:1043-1048`:
let ttl = Duration::from_micros(if config.ttl_micros == 0 {
warn!("ttl was not set for updating aggregate");
24 * 60 * 60 * 1000 * 1000 // 24 hours in microseconds
} else {
config.ttl_micros
});
Similar pattern for joins from `join_with_expiration.rs:246`:
warn!("TTL was not set for join with expiration");