Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Allenai Open instruct Beaker Cluster

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-07 00:00 GMT

Overview

Beaker experiment execution environment with cluster-specific NCCL networking, HuggingFace caching, and secret management.

Description

Open Instruct uses AI2's Beaker platform for managed GPU experiment execution. The mason.py launcher handles cluster-specific configuration including NCCL settings for WEKA clusters (InfiniBand) and GCP clusters (FastRack networking), HuggingFace cache paths, vLLM configuration, and secret injection. Three cluster types are supported: WEKA (on-premise with InfiniBand), GCP (cloud with FastRack networking), and generic clusters.

Usage

Use this environment for all production training experiments including SFT, DPO, GRPO, and Reward Modeling. The mason.py launcher automatically selects the correct cluster configuration. Local debugging does not require Beaker.

System Requirements

Category Requirement Notes
Platform Beaker (AI2 managed) Requires Beaker workspace access
GPU Cluster WEKA, GCP, or generic Each has different NCCL configuration
Networking InfiniBand (WEKA) or FastRack (GCP) For multi-node NCCL communication

Dependencies

Beaker Packages

  • `beaker-py` >= 2.5.3 (Python SDK)
  • Beaker CLI v1.5.235 (in Docker container)

Default Environment Variables (set by mason.py)

  • `RAY_CGRAPH_get_timeout` = 300
  • `VLLM_DISABLE_COMPILE_CACHE` = 1
  • `VLLM_USE_V1` = 1
  • `VLLM_ALLOW_INSECURE_SERIALIZATION` = 1
  • `VLLM_ATTENTION_BACKEND` = FLASH_ATTN
  • `NCCL_DEBUG` = ERROR
  • `VLLM_LOGGING_LEVEL` = WARNING

Credentials

The following secrets must be configured in the Beaker workspace:

  • `HF_TOKEN`: HuggingFace API token for model/dataset downloads
  • `WANDB_API_KEY`: Weights & Biases API key for experiment logging
  • `BEAKER_TOKEN`: Beaker authentication token
  • `OPENAI_API_KEY`: For LLM judge reward verification (GRPO)
  • `AZURE_API_KEY`: For Azure-based LLM judge
  • `AZURE_API_BASE`: Azure API endpoint URL
  • `ANTHROPIC_API_KEY`: For Anthropic-based LLM judge

Quick Install

# Install beaker-py for experiment management
pip install beaker-py>=2.5.3

# Launch an experiment via mason.py
python mason.py --cluster ai2/pluto-cirrascale --budget ai2/oe-adapt \
    --gpus 8 --task_name my_experiment -- python open_instruct/grpo_fast.py ...

Code Evidence

Secrets injection from `mason.py:286-300`:

useful_secrets = [
    "HF_TOKEN",
    "WANDB_API_KEY",
    "BEAKER_TOKEN",
    "OPENAI_API_KEY",
    "AZURE_API_KEY",
    "AZURE_API_BASE",
    "ANTHROPIC_API_KEY",
]

WEKA cluster NCCL config from `mason.py:324-325`:

beaker.BeakerEnvVar(name="NCCL_SOCKET_IFNAME", value="ib"),
beaker.BeakerEnvVar(name="NCCL_IB_HCA", value="^=mlx5_bond_0"),

Beaker detection from `olmo_core_callbacks.py:28,49-54`:

BEAKER_WORKLOAD_ID_ENV_VAR = "BEAKER_WORKLOAD_ID"
if self.enabled is None and BEAKER_WORKLOAD_ID_ENV_VAR in os.environ:
    self.enabled = True

Default vLLM environment from `mason.py:98-104`:

DEFAULT_ENV_VARS = {
    "RAY_CGRAPH_get_timeout": "300",
    "VLLM_DISABLE_COMPILE_CACHE": "1",
    "VLLM_USE_V1": "1",
    "VLLM_ALLOW_INSECURE_SERIALIZATION": "1",
    "VLLM_ATTENTION_BACKEND": "FLASH_ATTN",
}

Common Errors

Error Message Cause Solution
`Secret not found` Missing Beaker workspace secret Configure the required secret in your Beaker workspace
NCCL timeout on multi-node Incorrect network interface Verify cluster type is correctly detected by mason.py
`HF_HUB_ENABLE_HF_TRANSFER=0` on GCP Known upload issues on GCP clusters Already handled; mason.py disables HF transfer on GCP

Compatibility Notes

  • WEKA clusters: Use InfiniBand networking with `NCCL_SOCKET_IFNAME=ib`. HF cache at `/weka/oe-adapt-default/allennlp/.cache/`.
  • GCP clusters: Use FastRack networking with extensive NCCL tuning (20+ env vars). HF cache at `/filestore/.cache/`. HF transfer disabled.
  • Generic clusters: No special NCCL configuration. Standard HF cache paths.
  • W&B resumption: Beaker supports experiment preemption/resumption via `WANDB_RUN_ID` and `WANDB_RESUME=allow`.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment