Heuristic:Dagster io Dagster Retry Strategy Configuration
| Knowledge Sources | |
|---|---|
| Domains | Execution, Reliability |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Dagster's three-mode retry system (ENABLED, DISABLED, DEFERRED) with automatic re-execution and tag-based retry control.
Description
Dagster implements a retry system with three distinct modes rather than a simple on/off toggle. The ENABLED mode directly re-enqueues failed steps. The DISABLED mode provides no retries. The DEFERRED mode is used internally by orchestrator engines (multiprocess, step-delegating) where retries are managed by the engine itself rather than the step executor. Understanding this three-mode system is critical for correctly configuring retry behavior in production.
Usage
Use this heuristic when configuring retry behavior for production pipelines, especially when encountering unexpected retry behavior or when retries are not working as expected with multiprocess execution. It is also critical when using the dagster/max_retries and dagster/retry_on_asset_or_op_failure tags.
The Insight (Rule of Thumb)
- Action: Understand the three retry modes when configuring retries.
ENABLEDis the default;DEFERREDis automatically set for inner plan execution. - Value: Set
dagster/max_retriestag on runs to control retry count. The total retry count includes all runs in the group (including manual re-executions). - Trade-off: Setting
retry_on_asset_or_op_failure=falsewill prevent retries even when max_retries > 0. This is a silent override that logs a warning but does not raise an error. - Key insight: In multiprocess/step-delegating executors,
ENABLEDmode is automatically converted toDEFERREDfor inner plan execution. This means retries are handled by the engine, not the step.
Reasoning
The three-mode retry system exists because Dagster supports multiple executor types with different retry semantics:
- In-process executor: Steps retry immediately in the same process (
ENABLED). - Multiprocess executor: Steps must be re-enqueued by the engine, not the step itself. Using
ENABLEDwould cause double-retry behavior, so it is converted toDEFERRED. - Step-delegating executor (K8s, Docker): Similar to multiprocess, retries must be managed at the engine level to correctly allocate new pods/containers.
The dagster/max_retries tag counts all runs in the run group (not just automatic retries). This means if a user manually re-executes a failed run, that counts toward the max retries limit.
The retry_on_asset_or_op_failure flag provides a second layer of control: even if max_retries is set, setting this to false will suppress retries for step-level failures (only system-level failures will trigger retries).
Code Evidence
Three retry modes from retries.py:31-54:
class RetryMode(Enum):
ENABLED = "enabled"
DISABLED = "disabled"
# Designed for use of inner plan execution within "orchestrator"
# engine such as multiprocess, up_for_retry steps are not directly
# re-enqueued, deferring that to the engine.
DEFERRED = "deferred"
Auto-conversion for inner plan execution from retries.py:56-60:
def for_inner_plan(self) -> "RetryMode":
if self.disabled or self.deferred:
return self
elif self.enabled:
return RetryMode.DEFERRED