Heuristic:MaterializeInc Materialize CI Agent Prioritization
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Optimization |
| Last Updated | 2026-02-08 21:00 GMT |
Overview
Priority-based scheduling and agent queue escalation heuristics for Buildkite CI, including automatic fallback from Hetzner to AWS agents when queues are stuck.
Description
The Materialize CI pipeline uses a multi-tiered priority system and automatic agent queue switching to optimize build throughput and reliability. The `prioritize_pipeline()` function assigns priorities based on build context (release, PR, main, Dependabot), while `switch_jobs_to_aws()` monitors Hetzner agent queues for stuck jobs and automatically reroutes them to AWS or x86_64 equivalents. Additionally, `increase_agents_timeouts()` escalates agent sizes and multiplies timeouts for sanitizer and coverage builds.
Usage
Apply this heuristic when debugging why a CI job ran on an unexpected agent queue, understanding build priority ordering, or configuring new pipeline steps. It is essential for understanding the Mkpipeline_Main implementation's agent assignment logic.
The Insight (Rule of Thumb)
Priority Rules:
- Action: Assign numeric priorities to pipeline steps based on context.
- Value:
- Release tags (`v*`): +10 priority (time-sensitive)
- Main branch: -50 priority (less urgent than PRs)
- Dependabot PRs: -40 priority (less urgent than manual PRs)
- Larger Hetzner agents: +1 to +2 bonus (preferential treatment on shared queues)
- Trade-off: PRs from developers get faster feedback at the cost of slower main-branch and Dependabot builds.
Agent Queue Escalation (sanitizer/coverage builds):
- Action: Bump each agent to the next larger size for sanitizer and random-parameter builds.
- Value: Timeouts multiplied by 10x for sanitizer, 3x for coverage. Agent sizes escalated one tier up (e.g., `linux-aarch64-small` → `linux-aarch64` → `linux-aarch64-medium`).
- Trade-off: Higher infrastructure cost for more memory and CPU.
Hetzner Failover:
- Action: Detect stuck Hetzner queues (>20 minutes wait) and switch jobs to AWS or another architecture.
- Value: Known aarch64 availability issues on Hetzner are automatically mitigated.
- Trade-off: May change build architecture (aarch64 → x86_64), which changes the `depends_on` chain.
Reasoning
The priority system reflects the team's operational needs: release builds must complete quickly because downstream processes depend on them, while Dependabot and main-branch builds can tolerate delays. The Hetzner failover mechanism was introduced because aarch64 availability on Hetzner has been unreliable (noted with `TODO(def-): Remove me when Hetzner fixes its aarch64 availability`). The 20-minute stuck threshold is a pragmatic balance between avoiding premature failovers and not waiting too long for unavailable agents.
Code Evidence
Priority assignment from `mkpipeline.py:229-263`:
def prioritize_pipeline(pipeline, priority):
tag = os.environ["BUILDKITE_TAG"]
branch = os.getenv("BUILDKITE_BRANCH")
build_author = os.getenv("BUILDKITE_BUILD_AUTHOR")
priority += pipeline.get("priority", 0)
if tag.startswith("v"):
priority += 10 # Release results are time sensitive
if branch == "main":
priority -= 50 # main branch is less time sensitive
if build_author == "Dependabot":
priority -= 40 # Dependabot is less urgent
Stuck queue detection from `mkpipeline.py:439-447`:
if datetime.now(timezone.utc) - datetime.fromisoformat(
runnable
) < timedelta(minutes=20):
continue
print(f"Job {job.get('id')} ... is runnable since {runnable} on {queue}, "
f"considering {queue} stuck")
stuck.add(queue)
Hardcoded aarch64 stuck queues from `mkpipeline.py:368-376`:
# TODO(def-): Remove me when Hetzner fixes its aarch64 availability
stuck.update([
"hetzner-aarch64-16cpu-32gb",
"hetzner-aarch64-8cpu-16gb",
"hetzner-aarch64-4cpu-8gb",
"hetzner-aarch64-2cpu-4gb",
])