Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:MaterializeInc Materialize CI Retry Strategies

From Leeroopedia




Knowledge Sources
Domains CI_CD, Reliability
Last Updated 2026-02-08 21:00 GMT

Overview

Automatic retry configuration for Buildkite CI steps that handles transient infrastructure failures including agent disconnects, OS stops, GitHub outages, and Rust compiler crashes.

Description

The Materialize CI pipeline automatically retries steps that fail due to infrastructure issues rather than code bugs. The function `set_retry_on_agent_lost()` in `ci/mkpipeline.py` injects retry rules into every pipeline step. Additionally, the bootstrap step in `ci/mkpipeline.sh` has its own retry configuration. These retries are tuned to handle four specific failure modes that are known to be transient.

Usage

Apply this heuristic when configuring or debugging CI pipeline reliability. Understanding these retry strategies is essential when investigating why a failed build was automatically retried, or when adding new step types that need similar resilience. It applies to all Buildkite pipeline steps generated by Mkpipeline_Main and Trim_Tests_Pipeline.

The Insight (Rule of Thumb)

  • Action: Configure automatic retries on every Buildkite step for four specific transient failure modes.
  • Value: Each failure mode gets up to 2 automatic retries:
    • `exit_status: -1` with `signal_reason: none` — Agent connection lost
    • `signal_reason: agent_stop` — Agent stopped by OS (e.g., preemption)
    • `exit_status: 128` — Temporary GitHub connection issue during git clone/fetch
    • `exit_status: 199` — Rust Internal Compiler Error (ICE)
  • Trade-off: Retries add latency when the failure is not transient, but the 2-retry limit bounds this cost.
  • Additional: All steps also get `permit_on_passed: true` for manual rerunning of successful steps (useful for debugging flaky tests).

Reasoning

These four failure modes are well-characterized transient issues:

  1. Agent connection lost (-1): Buildkite agents on cloud VMs occasionally lose connectivity due to network hiccups, VM migrations, or spot instance interruptions. Retrying is safe because the build step is idempotent.
  2. Agent stop: The host OS may stop the agent process during maintenance, scaling events, or resource pressure. The step did not run to completion and should be retried.
  3. Exit status 128: Git operations (clone, fetch, checkout) fail with exit code 128 when GitHub has intermittent connectivity issues. These are always transient.
  4. Exit status 199 (Rust ICE): A known Rust compiler bug (rust-lang/rust#148581) causes nondeterministic panics in `rustc_metadata`. The `run_and_detect_rust_ice()` function in `mzbuild.py` detects the specific panic message and exits with code 199 for targeted retries.

The bootstrap step in `mkpipeline.sh` also configures retries for `-1` and `agent_stop` because the pipeline generation itself is a critical first step.

Code Evidence

Retry injection from `mkpipeline.py:529-553`:

def set_retry_on_agent_lost(pipeline: Any) -> None:
    for step in steps(pipeline):
        if "trigger" in step or "wait" in step or "group" in step or "block" in step:
            continue
        step.setdefault("retry", {}).setdefault("automatic", []).extend(
            [
                {"exit_status": -1, "signal_reason": "none", "limit": 2},
                {"signal_reason": "agent_stop", "limit": 2},
                {"exit_status": 128, "limit": 2},
                {"exit_status": 199, "limit": 2},  # Rust ICE
            ]
        )

Rust ICE detection from `mzbuild.py:60-132`:

def run_and_detect_rust_ice(cmd, cwd):
    # ... monitors stdout/stderr for specific panic message
    panic_msg = "panicked at compiler/rustc_metadata/src/rmeta/def_path_hash_map.rs"
    if panic_msg in stdout_contents or panic_msg in stderr_contents:
        raise RustICE()

Bootstrap retries from `mkpipeline.sh:67-73`:

    retry:
      automatic:
        - exit_status: -1
          signal_reason: none
          limit: 2
        - signal_reason: agent_stop
          limit: 2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment