Heuristic:Langchain ai Langgraph Retry Policy Configuration
| Knowledge Sources | |
|---|---|
| Domains | Reliability, Error_Handling, Agents |
| Last Updated | 2026-02-11 14:00 GMT |
Overview
Per-node retry configuration using exponential backoff with jitter to handle transient failures in LLM calls and external API requests.
Description
LangGraph provides a `RetryPolicy` that can be attached to individual nodes to automatically retry failed operations. The retry mechanism uses exponential backoff with optional random jitter to avoid thundering herd problems. The default policy retries transient failures (connection errors, HTTP 5xx) but explicitly does not retry deterministic errors (ValueError, TypeError, SyntaxError, etc.). Retries operate at the node level, not the graph level, meaning each node's writes are cleared between retry attempts.
Usage
Use this heuristic when building agents or workflows that call external APIs or LLM providers, where transient network failures, rate limits, or server errors are expected. It is especially important for production deployments where reliability is critical.
The Insight (Rule of Thumb)
- Action: Attach a `RetryPolicy` to nodes that make external calls using `add_node("name", func, retry=RetryPolicy(...))`.
- Default Values: `initial_interval=0.5s`, `backoff_factor=2.0`, `max_interval=128s`, `max_attempts=3`, `jitter=True`.
- Trade-off: Retries add latency on failure. 3 attempts with backoff means up to ~4 seconds of wait time before a final failure.
- Critical Detail: Node writes are cleared between retry attempts. Only the successful attempt's writes are preserved.
- Non-retryable Errors: `ValueError`, `TypeError`, `ArithmeticError`, `ImportError`, `LookupError`, `NameError`, `SyntaxError`, `RuntimeError`, `ReferenceError`, `StopIteration`, `OSError` are never retried by default.
- Retryable Errors: `ConnectionError`, HTTP 5xx status codes (via `httpx.HTTPStatusError` or `requests.HTTPError`), and any other exception not in the non-retryable list.
Reasoning
LLM API calls and external service requests frequently experience transient failures: rate limits (HTTP 429), server errors (HTTP 500-503), and network timeouts. Without retry logic, a single transient failure would fail the entire graph execution. The exponential backoff strategy prevents overwhelming the failing service, and jitter prevents multiple concurrent retries from synchronizing (thundering herd). The default 3 attempts with 0.5s/1.0s/2.0s intervals strike a balance between resilience and responsiveness.
The decision to clear writes between retries ensures that partial state mutations from a failed attempt do not corrupt the graph state. This is a deliberate design choice documented in the retry implementation.
Code Evidence
RetryPolicy defaults from `libs/langgraph/langgraph/types.py:119-138`:
class RetryPolicy(NamedTuple):
"""Configuration for retrying nodes."""
initial_interval: float = 0.5
"""Amount of time that must elapse before the first retry occurs. In seconds."""
backoff_factor: float = 2.0
"""Multiplier by which the interval increases after each retry."""
max_interval: float = 128.0
"""Maximum amount of time that may elapse between retries. In seconds."""
max_attempts: int = 3
"""Maximum number of attempts to make before giving up, including the first."""
jitter: bool = True
"""Whether to add random jitter to the interval between retries."""
retry_on: (...) = default_retry_on
Non-retryable exception filter from `libs/langgraph/langgraph/_internal/_retry.py:1-29`:
def default_retry_on(exc: Exception) -> bool:
if isinstance(exc, ConnectionError):
return True
if isinstance(exc, httpx.HTTPStatusError):
return 500 <= exc.response.status_code < 600
if isinstance(exc, requests.HTTPError):
return 500 <= exc.response.status_code < 600 if exc.response else True
if isinstance(exc, (ValueError, TypeError, ArithmeticError, ImportError,
LookupError, NameError, SyntaxError, RuntimeError,
ReferenceError, StopIteration, StopAsyncIteration, OSError)):
return False
return True
Write clearing on retry from `libs/langgraph/langgraph/pregel/_retry.py:39-42`:
# clear any writes from previous attempts
task.writes.clear()
# run the task
return task.proc.invoke(task.input, config)