Heuristic:Risingwavelabs Risingwave Source Backoff Strategy
| Knowledge Sources | |
|---|---|
| Domains | Streaming, Reliability, Error_Handling |
| Last Updated | 2026-02-09 08:00 GMT |
Overview
Source executors use exponential backoff (1s base, 2x factor, 10s max) for transient failures and wait 5x the barrier interval before timing out on barrier delivery from the meta service.
Description
RisingWave source executors (Kafka, CDC, etc.) implement two distinct retry/wait strategies. For transient source failures (connection drops, timeout errors), an exponential backoff strategy starts at 1 second, doubles on each retry, and caps at 10 seconds. For barrier synchronization with the meta service, sources wait up to 5x the configured barrier interval before timing out. These values were chosen to handle common failure scenarios: brief network glitches (1-2s recovery), longer outages (capped at 10s to avoid excessive idle time), and meta service latency spikes (5x multiplier accounts for GC pauses, network congestion, etc.).
Usage
Apply this heuristic when debugging source connector reconnection behavior, tuning barrier intervals for high-latency environments, or diagnosing checkpoint timeouts. Understanding these timings helps explain why sources take 1-10 seconds to recover from transient failures and why checkpoint intervals affect source reliability.
The Insight (Rule of Thumb)
- Source Failure Backoff: Base delay: 1s, factor: 2x, max: 10s.
- Barrier Wait Timeout: 5x the barrier interval (
WAIT_BARRIER_MULTIPLE_TIMES = 5). - Storage Request Retry: 3 attempts, 1-10s backoff.
- Connection Pool: Meta connections: 1-10 pool, 30s idle timeout, 30s acquire timeout.
- Trade-off: Short base delay (1s) enables fast recovery from brief glitches. The 10s cap prevents excessive wait times during extended outages. The 5x barrier multiplier is generous to avoid false timeouts during normal operation.
Reasoning
Streaming sources must handle transient failures gracefully without overwhelming downstream systems with reconnection attempts. The 1s base delay is short enough that brief network blips (common in cloud environments) resolve quickly. The 2x exponential factor prevents thundering herd problems when multiple sources reconnect simultaneously. The 10s cap ensures that even during extended outages, the system does not wait excessively before retrying.
The barrier wait multiplier of 5x is intentionally generous. In a streaming system, barriers are critical for checkpoint consistency. A source that times out on barrier delivery will disrupt the entire checkpoint, potentially causing cascading recovery. By waiting 5x the interval, the system tolerates significant meta service latency without false timeouts, while still detecting genuine failures within a reasonable window.
Code evidence from src/stream/src/executor/source/mod.rs lines 213-215:
const BASE_DELAY: Duration = Duration::from_secs(1);
const BACKOFF_FACTOR: u64 = 2;
const MAX_DELAY: Duration = Duration::from_secs(10);
Code evidence from src/stream/src/executor/source/source_executor.rs line 56:
pub const WAIT_BARRIER_MULTIPLE_TIMES: u128 = 5;
Code evidence from src/common/src/config/storage.rs lines 1228-1230:
const DEFAULT_REQ_BACKOFF_INTERVAL_MS: u64 = 1000;
const DEFAULT_REQ_BACKOFF_MAX_DELAY_MS: u64 = 10 * 1000;
const DEFAULT_REQ_MAX_RETRY_ATTEMPTS: usize = 3;