Environment:Allenai Open instruct Beaker Cluster
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Beaker experiment execution environment with cluster-specific NCCL networking, HuggingFace caching, and secret management.
Description
Open Instruct uses AI2's Beaker platform for managed GPU experiment execution. The mason.py launcher handles cluster-specific configuration including NCCL settings for WEKA clusters (InfiniBand) and GCP clusters (FastRack networking), HuggingFace cache paths, vLLM configuration, and secret injection. Three cluster types are supported: WEKA (on-premise with InfiniBand), GCP (cloud with FastRack networking), and generic clusters.
Usage
Use this environment for all production training experiments including SFT, DPO, GRPO, and Reward Modeling. The mason.py launcher automatically selects the correct cluster configuration. Local debugging does not require Beaker.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Platform | Beaker (AI2 managed) | Requires Beaker workspace access |
| GPU Cluster | WEKA, GCP, or generic | Each has different NCCL configuration |
| Networking | InfiniBand (WEKA) or FastRack (GCP) | For multi-node NCCL communication |
Dependencies
Beaker Packages
- `beaker-py` >= 2.5.3 (Python SDK)
- Beaker CLI v1.5.235 (in Docker container)
Default Environment Variables (set by mason.py)
- `RAY_CGRAPH_get_timeout` = 300
- `VLLM_DISABLE_COMPILE_CACHE` = 1
- `VLLM_USE_V1` = 1
- `VLLM_ALLOW_INSECURE_SERIALIZATION` = 1
- `VLLM_ATTENTION_BACKEND` = FLASH_ATTN
- `NCCL_DEBUG` = ERROR
- `VLLM_LOGGING_LEVEL` = WARNING
Credentials
The following secrets must be configured in the Beaker workspace:
- `HF_TOKEN`: HuggingFace API token for model/dataset downloads
- `WANDB_API_KEY`: Weights & Biases API key for experiment logging
- `BEAKER_TOKEN`: Beaker authentication token
- `OPENAI_API_KEY`: For LLM judge reward verification (GRPO)
- `AZURE_API_KEY`: For Azure-based LLM judge
- `AZURE_API_BASE`: Azure API endpoint URL
- `ANTHROPIC_API_KEY`: For Anthropic-based LLM judge
Quick Install
# Install beaker-py for experiment management
pip install beaker-py>=2.5.3
# Launch an experiment via mason.py
python mason.py --cluster ai2/pluto-cirrascale --budget ai2/oe-adapt \
--gpus 8 --task_name my_experiment -- python open_instruct/grpo_fast.py ...
Code Evidence
Secrets injection from `mason.py:286-300`:
useful_secrets = [
"HF_TOKEN",
"WANDB_API_KEY",
"BEAKER_TOKEN",
"OPENAI_API_KEY",
"AZURE_API_KEY",
"AZURE_API_BASE",
"ANTHROPIC_API_KEY",
]
WEKA cluster NCCL config from `mason.py:324-325`:
beaker.BeakerEnvVar(name="NCCL_SOCKET_IFNAME", value="ib"),
beaker.BeakerEnvVar(name="NCCL_IB_HCA", value="^=mlx5_bond_0"),
Beaker detection from `olmo_core_callbacks.py:28,49-54`:
BEAKER_WORKLOAD_ID_ENV_VAR = "BEAKER_WORKLOAD_ID"
if self.enabled is None and BEAKER_WORKLOAD_ID_ENV_VAR in os.environ:
self.enabled = True
Default vLLM environment from `mason.py:98-104`:
DEFAULT_ENV_VARS = {
"RAY_CGRAPH_get_timeout": "300",
"VLLM_DISABLE_COMPILE_CACHE": "1",
"VLLM_USE_V1": "1",
"VLLM_ALLOW_INSECURE_SERIALIZATION": "1",
"VLLM_ATTENTION_BACKEND": "FLASH_ATTN",
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Secret not found` | Missing Beaker workspace secret | Configure the required secret in your Beaker workspace |
| NCCL timeout on multi-node | Incorrect network interface | Verify cluster type is correctly detected by mason.py |
| `HF_HUB_ENABLE_HF_TRANSFER=0` on GCP | Known upload issues on GCP clusters | Already handled; mason.py disables HF transfer on GCP |
Compatibility Notes
- WEKA clusters: Use InfiniBand networking with `NCCL_SOCKET_IFNAME=ib`. HF cache at `/weka/oe-adapt-default/allennlp/.cache/`.
- GCP clusters: Use FastRack networking with extensive NCCL tuning (20+ env vars). HF cache at `/filestore/.cache/`. HF transfer disabled.
- Generic clusters: No special NCCL configuration. Standard HF cache paths.
- W&B resumption: Beaker supports experiment preemption/resumption via `WANDB_RUN_ID` and `WANDB_RESUME=allow`.