Heuristic:MaterializeInc Materialize CI Agent Prioritization

Knowledge Sources	MaterializeInc/materialize
Domains	CI_CD, Optimization
Last Updated	2026-02-08 21:00 GMT

Overview

Priority-based scheduling and agent queue escalation heuristics for Buildkite CI, including automatic fallback from Hetzner to AWS agents when queues are stuck.

Description

The Materialize CI pipeline uses a multi-tiered priority system and automatic agent queue switching to optimize build throughput and reliability. The `prioritize_pipeline()` function assigns priorities based on build context (release, PR, main, Dependabot), while `switch_jobs_to_aws()` monitors Hetzner agent queues for stuck jobs and automatically reroutes them to AWS or x86_64 equivalents. Additionally, `increase_agents_timeouts()` escalates agent sizes and multiplies timeouts for sanitizer and coverage builds.

Usage

Apply this heuristic when debugging why a CI job ran on an unexpected agent queue, understanding build priority ordering, or configuring new pipeline steps. It is essential for understanding the Mkpipeline_Main implementation's agent assignment logic.

The Insight (Rule of Thumb)

Priority Rules:

Action: Assign numeric priorities to pipeline steps based on context.
Value:
- Release tags (`v*`): +10 priority (time-sensitive)
- Main branch: -50 priority (less urgent than PRs)
- Dependabot PRs: -40 priority (less urgent than manual PRs)
- Larger Hetzner agents: +1 to +2 bonus (preferential treatment on shared queues)
Trade-off: PRs from developers get faster feedback at the cost of slower main-branch and Dependabot builds.

Agent Queue Escalation (sanitizer/coverage builds):

Action: Bump each agent to the next larger size for sanitizer and random-parameter builds.
Value: Timeouts multiplied by 10x for sanitizer, 3x for coverage. Agent sizes escalated one tier up (e.g., `linux-aarch64-small` → `linux-aarch64` → `linux-aarch64-medium`).
Trade-off: Higher infrastructure cost for more memory and CPU.

Hetzner Failover:

Action: Detect stuck Hetzner queues (>20 minutes wait) and switch jobs to AWS or another architecture.
Value: Known aarch64 availability issues on Hetzner are automatically mitigated.
Trade-off: May change build architecture (aarch64 → x86_64), which changes the `depends_on` chain.

Reasoning

The priority system reflects the team's operational needs: release builds must complete quickly because downstream processes depend on them, while Dependabot and main-branch builds can tolerate delays. The Hetzner failover mechanism was introduced because aarch64 availability on Hetzner has been unreliable (noted with `TODO(def-): Remove me when Hetzner fixes its aarch64 availability`). The 20-minute stuck threshold is a pragmatic balance between avoiding premature failovers and not waiting too long for unavailable agents.

Code Evidence

Priority assignment from `mkpipeline.py:229-263`:

def prioritize_pipeline(pipeline, priority):
    tag = os.environ["BUILDKITE_TAG"]
    branch = os.getenv("BUILDKITE_BRANCH")
    build_author = os.getenv("BUILDKITE_BUILD_AUTHOR")
    priority += pipeline.get("priority", 0)
    if tag.startswith("v"):
        priority += 10  # Release results are time sensitive
    if branch == "main":
        priority -= 50  # main branch is less time sensitive
    if build_author == "Dependabot":
        priority -= 40  # Dependabot is less urgent

Stuck queue detection from `mkpipeline.py:439-447`:

if datetime.now(timezone.utc) - datetime.fromisoformat(
    runnable
) < timedelta(minutes=20):
    continue
print(f"Job {job.get('id')} ... is runnable since {runnable} on {queue}, "
      f"considering {queue} stuck")
stuck.add(queue)

Hardcoded aarch64 stuck queues from `mkpipeline.py:368-376`:

# TODO(def-): Remove me when Hetzner fixes its aarch64 availability
stuck.update([
    "hetzner-aarch64-16cpu-32gb",
    "hetzner-aarch64-8cpu-16gb",
    "hetzner-aarch64-4cpu-8gb",
    "hetzner-aarch64-2cpu-4gb",
])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment