Principle:Apache Dolphinscheduler RPC Retry And Error Recovery
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Systems, Fault_Tolerance |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A declarative retry and error recovery pattern for RPC calls that uses annotation-based configuration and automatic channel reconnection to handle transient network failures.
Description
The RPC Retry and Error Recovery principle addresses the inherent unreliability of network communication in distributed systems. DolphinScheduler provides two layers of resilience: (1) annotation-based retry via @RpcMethodRetryStrategy on individual RPC methods, and (2) channel-level recovery in NettyRemotingClient which automatically recreates Netty channels when they become inactive. Metrics are recorded via ClientSyncDurationMetrics and ClientSyncExceptionMetrics for observability.
When a sendSync() call fails due to a RemoteException (connection failure) or RemoteTimeoutException (response timeout), the retry strategy determines whether and how to retry the call.
Usage
Configure retry behavior on individual RPC methods using @RpcMethod(retry = @RpcMethodRetryStrategy). For methods that should not retry (idempotency concerns), use default settings. For methods that are safe to retry (read operations, idempotent writes), configure appropriate retry counts.
Theoretical Basis
The retry pattern follows standard distributed systems resilience principles:
- Retry with Backoff: Failed calls can be retried with configurable strategy
- Circuit Breaking: Channel inactivity triggers cleanup and reconnection
- Observability: Duration and exception metrics enable monitoring
- Idempotency Awareness: Retry configuration is per-method to respect idempotency requirements
// Retry flow in sendSync
for (attempt = 0; attempt <= maxRetries; attempt++):
try:
channel = getOrCreateChannel(host)
response = doSendSync(host, transporter)
recordDuration(elapsed)
return response
catch RemoteException:
onChannelInactive(host) // clean up dead channel
recordException(e)
if (attempt == maxRetries) throw e
// retry with fresh channel on next iteration