Implementation:Allenai Open instruct Mason Main
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Distributed Training, Experiment Management, CLI Tooling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for launching containerized training experiments on Beaker compute clusters provided by Open Instruct's Mason CLI.
Description
The main() function in mason.py is the CLI entrypoint for the Mason experiment launcher. It orchestrates the complete lifecycle of submitting an Open Instruct training job to Beaker: parsing command-line arguments, detecting whether the user is internal (AI2) or external, optionally caching datasets locally, building internal commands with proper environment variables and accelerate configurations, constructing a Beaker experiment specification, and submitting the experiment to the Beaker platform with exponential backoff retry logic.
The function handles several complex workflows:
- User detection: Checks for Beaker credentials (
~/.beaker/config.ymlorBEAKER_TOKEN) to distinguish AI2 internal users from external users. External users receive the rendered command for manual execution. - Command construction: Calls
make_internal_command()for each command, which handles dataset caching, output directory rewriting, GCP model uploads, multi-node accelerate configuration, and proper shell escaping. - Task specification: Calls
make_task_spec()to build aBeakerTaskSpecwith GPU resources, cluster constraints, environment variables (including secrets for HF_TOKEN, WANDB_API_KEY, etc.), data mounts (Weka or GCS), and multi-node settings. - Experiment submission: Creates a
BeakerExperimentSpecwith budget, retry policy, and all task specs, then submits it with retry logic (up to 5 attempts with exponential backoff) to handle transient timeout errors.
Usage
Use this tool when you need to:
- Launch any Open Instruct training job (SFT, DPO, GRPO, reward modeling) on Beaker clusters.
- Submit multi-node distributed training experiments with automatic accelerate configuration.
- Run experiments that require dataset pre-caching, automatic output directory management, or GCP model staging.
- Execute custom Python commands on Beaker GPU infrastructure outside of Open Instruct training.
Code Reference
Source Location
- Repository: Open Instruct
- File:
mason.py, lines 833-895 - Supporting functions:
get_args()(L108-222),make_internal_command()(L426-737),make_task_spec()(L740-799)
Signature
def main():
args, commands = get_args()
# Detects AI2 vs external user based on Beaker credentials
# For external users: prints rendered commands and returns
# For internal users:
# 1. Builds internal commands (caching, output dir, GCP staging)
# 2. Constructs BeakerExperimentSpec with tasks, budget, retry
# 3. Submits experiment with exponential backoff retry
The get_args() function defines the full argument parser:
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--cluster", type=str, nargs="+", required=True,
help="Beaker clusters on which the job could be run.")
parser.add_argument("--budget", type=str, required=True, help="Budget to use.")
parser.add_argument("--gpus", type=int, default=0, help="Number of gpus")
parser.add_argument("--num_nodes", type=int, default=1, help="Number of nodes")
parser.add_argument("--image", type=str,
default="ai2/cuda11.8-cudnn8-dev-ubuntu20.04",
help="Beaker base image.")
parser.add_argument("--workspace", type=str, default=None,
help="The Beaker workspace to use.")
parser.add_argument("--priority", type=str, default="normal",
help="Beaker job priority.")
parser.add_argument("--preemptible", action="store_true",
help="Run as preemptible")
parser.add_argument("--max_retries", type=int, default=0,
help="Number of retries")
parser.add_argument("--shared_memory", type=str, default="10.24gb")
parser.add_argument("--description", type=str,
default="Beaker-Mason job.")
parser.add_argument("--task_name", type=str, default="beaker_mason")
parser.add_argument("--beaker_datasets", nargs="*",
type=parse_beaker_dataset, default=[])
parser.add_argument("--pure_docker_mode", action="store_true")
parser.add_argument("--non_resumable", action="store_true")
parser.add_argument("--no_auto_dataset_cache", action="store_true")
parser.add_argument("--auto_output_dir_path", type=str,
default="/weka/oe-adapt-default/allennlp/deletable_checkpoint")
parser.add_argument("--env", type=parse_env_var, action="append", default=[])
parser.add_argument("--secret", type=parse_env_var, action="append", default=[])
parser.add_argument("--no-host-networking", action="store_true")
parser.add_argument("--timeout", type=str, default=None)
# ...
mason_args, command_args = parser.parse_known_args()
commands = parse_commands(command_args)
return mason_args, commands
Import
# Run as CLI tool (primary usage)
python mason.py --cluster ai2/jupiter --budget ai2/oe-adapt --gpus 8 \
-- python open_instruct/grpo_fast.py [training_args...]
# Or via the wrapper script which builds a Docker image first
./scripts/train/build_image_and_launch.sh scripts/train/debug/single_gpu_on_beaker.sh
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --cluster | str (nargs=+) | Yes | Target Beaker cluster(s) for scheduling (e.g., ai2/jupiter, ai2/saturn).
|
| --budget | str | Yes | Organizational budget account for billing (e.g., ai2/oe-adapt).
|
| --gpus | int | No (default: 0) | Number of GPUs per node to allocate. |
| --num_nodes | int | No (default: 1) | Number of nodes for distributed training. Values > 1 enable multi-node mode with leader election. |
| --image | str | No (default: ai2/cuda11.8-cudnn8-dev-ubuntu20.04) | Beaker base Docker image to use for the experiment. |
| --workspace | str | No (default: user default) | Beaker workspace for experiment organization (e.g., ai2/tulu-3-dev).
|
| --priority | str | No (default: normal) | Job priority: low, normal, high, or urgent. |
| --preemptible | flag | No (default: false) | Whether the job can be preempted by higher-priority jobs. |
| --max_retries | int | No (default: 0) | Maximum number of automatic retries for failed tasks. |
| --shared_memory | str | No (default: 10.24gb) | Shared memory allocation for the container. |
| --description | str | No (default: "Beaker-Mason job.") | Human-readable description for the experiment. |
| --beaker_datasets | str (nargs=*) | No | Additional Beaker datasets to mount, formatted as mount_path:beaker_id.
|
| --env | str (repeatable) | No | Additional environment variables in name=value format.
|
| --secret | str (repeatable) | No | Additional secret environment variables in name=value format.
|
| --non_resumable | flag | No | Disable automatic resumable mode for GRPO jobs. |
| --no_auto_dataset_cache | flag | No | Skip local dataset caching before submission. |
| --pure_docker_mode | flag | No | Run in pure Docker mode without mounting shared filesystem. |
| --timeout | str | No | Task timeout as duration string (e.g., 15m, 1h, 2h30m).
|
| -- [command...] | str | Yes | The training command separated by -- delimiter (e.g., -- python open_instruct/grpo_fast.py ...).
|
Outputs
| Name | Type | Description |
|---|---|---|
| Beaker Experiment | Beaker experiment object | A submitted experiment on the Beaker platform with a unique ID and tracking URL. |
| Console URL | str | Logged URL in format https://beaker.org/ex/{experiment_id} for monitoring.
|
| Cached Dataset | directory (side effect) | Preprocessed dataset stored on shared storage if auto-caching is enabled. |
| Docker Image | container image (side effect) | Built and pushed Docker image when using build_image_and_launch.sh.
|
Usage Examples
Basic Usage
# Single-GPU GRPO training on Jupiter cluster
python mason.py \
--cluster ai2/jupiter \
--budget ai2/oe-adapt \
--gpus 8 \
--workspace ai2/tulu-3-dev \
-- python open_instruct/grpo_fast.py \
--model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
--dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
--max_seq_length 2048 \
--output_dir /output/
Tulu3 8B Example
# Multi-node GRPO training for Tulu 3 8B (2 nodes, 8 GPUs each)
python mason.py \
--cluster ai2/jupiter ai2/saturn \
--budget ai2/oe-adapt \
--gpus 8 \
--num_nodes 2 \
--workspace ai2/tulu-3-dev \
--priority normal \
--max_retries 3 \
-- accelerate launch --num_processes 8 \
open_instruct/grpo_fast.py \
--model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
--dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
--max_seq_length 4096 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 5e-7 \
--with_tracking \
--output_dir /output/tulu3-8b-grpo
DPO Training Example
# Single-GPU DPO training via the wrapper script
./scripts/train/build_image_and_launch.sh scripts/train/debug/dpo/single_gpu.sh
# Or directly with mason
python mason.py \
--cluster ai2/jupiter \
--budget ai2/oe-adapt \
--gpus 8 \
--workspace ai2/tulu-3-dev \
-- python open_instruct/dpo.py \
--model_name_or_path allenai/Llama-3.1-Tulu-3-8B-SFT \
--dataset_mixer '{"allenai/tulu-3-pref-mixture": 1.0}' \
--max_seq_length 2048 \
--output_dir /output/tulu3-8b-dpo
External User Example
# External users (without Beaker credentials) will see the rendered command
# printed to console for manual execution on their own infrastructure.
python mason.py \
--cluster my-cluster \
--budget my-budget \
--gpus 4 \
-- python open_instruct/finetune.py \
--model_name_or_path meta-llama/Llama-3.1-8B \
--output_dir /output/my-sft-run
Dependencies
| Package | Purpose |
|---|---|
beaker |
Python SDK for Beaker platform (experiment creation, secret management, dataset mounts) |
backoff |
Exponential backoff retry logic for experiment submission |
requests |
HTTP requests (used by backoff for timeout exception handling) |
rich |
Console output formatting with colored rules and styled text |
open_instruct.launch_utils |
Cluster constants (WEKA_CLUSTERS, GCP_CLUSTERS, INTERCONNECT_CLUSTERS), GCS operations, HF download utilities |
argparse |
CLI argument parsing |
subprocess |
Local dataset caching command execution |
docker |
Docker image building (via build_image_and_launch.sh wrapper)
|