Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Dagster io Dagster Retry Strategy Configuration

From Leeroopedia



Knowledge Sources
Domains Execution, Reliability
Last Updated 2026-02-10 12:00 GMT

Overview

Dagster's three-mode retry system (ENABLED, DISABLED, DEFERRED) with automatic re-execution and tag-based retry control.

Description

Dagster implements a retry system with three distinct modes rather than a simple on/off toggle. The ENABLED mode directly re-enqueues failed steps. The DISABLED mode provides no retries. The DEFERRED mode is used internally by orchestrator engines (multiprocess, step-delegating) where retries are managed by the engine itself rather than the step executor. Understanding this three-mode system is critical for correctly configuring retry behavior in production.

Usage

Use this heuristic when configuring retry behavior for production pipelines, especially when encountering unexpected retry behavior or when retries are not working as expected with multiprocess execution. It is also critical when using the dagster/max_retries and dagster/retry_on_asset_or_op_failure tags.

The Insight (Rule of Thumb)

  • Action: Understand the three retry modes when configuring retries. ENABLED is the default; DEFERRED is automatically set for inner plan execution.
  • Value: Set dagster/max_retries tag on runs to control retry count. The total retry count includes all runs in the group (including manual re-executions).
  • Trade-off: Setting retry_on_asset_or_op_failure=false will prevent retries even when max_retries > 0. This is a silent override that logs a warning but does not raise an error.
  • Key insight: In multiprocess/step-delegating executors, ENABLED mode is automatically converted to DEFERRED for inner plan execution. This means retries are handled by the engine, not the step.

Reasoning

The three-mode retry system exists because Dagster supports multiple executor types with different retry semantics:

  • In-process executor: Steps retry immediately in the same process (ENABLED).
  • Multiprocess executor: Steps must be re-enqueued by the engine, not the step itself. Using ENABLED would cause double-retry behavior, so it is converted to DEFERRED.
  • Step-delegating executor (K8s, Docker): Similar to multiprocess, retries must be managed at the engine level to correctly allocate new pods/containers.

The dagster/max_retries tag counts all runs in the run group (not just automatic retries). This means if a user manually re-executes a failed run, that counts toward the max retries limit.

The retry_on_asset_or_op_failure flag provides a second layer of control: even if max_retries is set, setting this to false will suppress retries for step-level failures (only system-level failures will trigger retries).

Code Evidence

Three retry modes from retries.py:31-54:

class RetryMode(Enum):
    ENABLED = "enabled"
    DISABLED = "disabled"
    # Designed for use of inner plan execution within "orchestrator"
    # engine such as multiprocess, up_for_retry steps are not directly
    # re-enqueued, deferring that to the engine.
    DEFERRED = "deferred"

Auto-conversion for inner plan execution from retries.py:56-60:

def for_inner_plan(self) -> "RetryMode":
    if self.disabled or self.deferred:
        return self
    elif self.enabled:
        return RetryMode.DEFERRED

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment