Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Dolphinscheduler RPC Retry And Error Recovery

From Leeroopedia


Knowledge Sources
Domains Distributed_Systems, Fault_Tolerance
Last Updated 2026-02-10 00:00 GMT

Overview

A declarative retry and error recovery pattern for RPC calls that uses annotation-based configuration and automatic channel reconnection to handle transient network failures.

Description

The RPC Retry and Error Recovery principle addresses the inherent unreliability of network communication in distributed systems. DolphinScheduler provides two layers of resilience: (1) annotation-based retry via @RpcMethodRetryStrategy on individual RPC methods, and (2) channel-level recovery in NettyRemotingClient which automatically recreates Netty channels when they become inactive. Metrics are recorded via ClientSyncDurationMetrics and ClientSyncExceptionMetrics for observability.

When a sendSync() call fails due to a RemoteException (connection failure) or RemoteTimeoutException (response timeout), the retry strategy determines whether and how to retry the call.

Usage

Configure retry behavior on individual RPC methods using @RpcMethod(retry = @RpcMethodRetryStrategy). For methods that should not retry (idempotency concerns), use default settings. For methods that are safe to retry (read operations, idempotent writes), configure appropriate retry counts.

Theoretical Basis

The retry pattern follows standard distributed systems resilience principles:

  • Retry with Backoff: Failed calls can be retried with configurable strategy
  • Circuit Breaking: Channel inactivity triggers cleanup and reconnection
  • Observability: Duration and exception metrics enable monitoring
  • Idempotency Awareness: Retry configuration is per-method to respect idempotency requirements
// Retry flow in sendSync
for (attempt = 0; attempt <= maxRetries; attempt++):
    try:
        channel = getOrCreateChannel(host)
        response = doSendSync(host, transporter)
        recordDuration(elapsed)
        return response
    catch RemoteException:
        onChannelInactive(host)  // clean up dead channel
        recordException(e)
        if (attempt == maxRetries) throw e
        // retry with fresh channel on next iteration

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment