Implementation:Allenai Open instruct Mason Main

Knowledge Sources	Open Instruct
Domains	MLOps, Distributed Training, Experiment Management, CLI Tooling
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for launching containerized training experiments on Beaker compute clusters provided by Open Instruct's Mason CLI.

Description

The main() function in mason.py is the CLI entrypoint for the Mason experiment launcher. It orchestrates the complete lifecycle of submitting an Open Instruct training job to Beaker: parsing command-line arguments, detecting whether the user is internal (AI2) or external, optionally caching datasets locally, building internal commands with proper environment variables and accelerate configurations, constructing a Beaker experiment specification, and submitting the experiment to the Beaker platform with exponential backoff retry logic.

The function handles several complex workflows:

User detection: Checks for Beaker credentials (~/.beaker/config.yml or BEAKER_TOKEN) to distinguish AI2 internal users from external users. External users receive the rendered command for manual execution.
Command construction: Calls make_internal_command() for each command, which handles dataset caching, output directory rewriting, GCP model uploads, multi-node accelerate configuration, and proper shell escaping.
Task specification: Calls make_task_spec() to build a BeakerTaskSpec with GPU resources, cluster constraints, environment variables (including secrets for HF_TOKEN, WANDB_API_KEY, etc.), data mounts (Weka or GCS), and multi-node settings.
Experiment submission: Creates a BeakerExperimentSpec with budget, retry policy, and all task specs, then submits it with retry logic (up to 5 attempts with exponential backoff) to handle transient timeout errors.

Usage

Use this tool when you need to:

Launch any Open Instruct training job (SFT, DPO, GRPO, reward modeling) on Beaker clusters.
Submit multi-node distributed training experiments with automatic accelerate configuration.
Run experiments that require dataset pre-caching, automatic output directory management, or GCP model staging.
Execute custom Python commands on Beaker GPU infrastructure outside of Open Instruct training.

Code Reference

Source Location

Repository: Open Instruct
File: mason.py, lines 833-895
Supporting functions: get_args() (L108-222), make_internal_command() (L426-737), make_task_spec() (L740-799)

Signature

def main():
    args, commands = get_args()
    # Detects AI2 vs external user based on Beaker credentials
    # For external users: prints rendered commands and returns
    # For internal users:
    #   1. Builds internal commands (caching, output dir, GCP staging)
    #   2. Constructs BeakerExperimentSpec with tasks, budget, retry
    #   3. Submits experiment with exponential backoff retry

The get_args() function defines the full argument parser:

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--cluster", type=str, nargs="+", required=True,
                        help="Beaker clusters on which the job could be run.")
    parser.add_argument("--budget", type=str, required=True, help="Budget to use.")
    parser.add_argument("--gpus", type=int, default=0, help="Number of gpus")
    parser.add_argument("--num_nodes", type=int, default=1, help="Number of nodes")
    parser.add_argument("--image", type=str,
                        default="ai2/cuda11.8-cudnn8-dev-ubuntu20.04",
                        help="Beaker base image.")
    parser.add_argument("--workspace", type=str, default=None,
                        help="The Beaker workspace to use.")
    parser.add_argument("--priority", type=str, default="normal",
                        help="Beaker job priority.")
    parser.add_argument("--preemptible", action="store_true",
                        help="Run as preemptible")
    parser.add_argument("--max_retries", type=int, default=0,
                        help="Number of retries")
    parser.add_argument("--shared_memory", type=str, default="10.24gb")
    parser.add_argument("--description", type=str,
                        default="Beaker-Mason job.")
    parser.add_argument("--task_name", type=str, default="beaker_mason")
    parser.add_argument("--beaker_datasets", nargs="*",
                        type=parse_beaker_dataset, default=[])
    parser.add_argument("--pure_docker_mode", action="store_true")
    parser.add_argument("--non_resumable", action="store_true")
    parser.add_argument("--no_auto_dataset_cache", action="store_true")
    parser.add_argument("--auto_output_dir_path", type=str,
                        default="/weka/oe-adapt-default/allennlp/deletable_checkpoint")
    parser.add_argument("--env", type=parse_env_var, action="append", default=[])
    parser.add_argument("--secret", type=parse_env_var, action="append", default=[])
    parser.add_argument("--no-host-networking", action="store_true")
    parser.add_argument("--timeout", type=str, default=None)
    # ...
    mason_args, command_args = parser.parse_known_args()
    commands = parse_commands(command_args)
    return mason_args, commands

Import

# Run as CLI tool (primary usage)
python mason.py --cluster ai2/jupiter --budget ai2/oe-adapt --gpus 8 \
    -- python open_instruct/grpo_fast.py [training_args...]

# Or via the wrapper script which builds a Docker image first
./scripts/train/build_image_and_launch.sh scripts/train/debug/single_gpu_on_beaker.sh

I/O Contract

Inputs

Name	Type	Required	Description
--cluster	str (nargs=+)	Yes	Target Beaker cluster(s) for scheduling (e.g., `ai2/jupiter`, `ai2/saturn`).
--budget	str	Yes	Organizational budget account for billing (e.g., `ai2/oe-adapt`).
--gpus	int	No (default: 0)	Number of GPUs per node to allocate.
--num_nodes	int	No (default: 1)	Number of nodes for distributed training. Values > 1 enable multi-node mode with leader election.
--image	str	No (default: ai2/cuda11.8-cudnn8-dev-ubuntu20.04)	Beaker base Docker image to use for the experiment.
--workspace	str	No (default: user default)	Beaker workspace for experiment organization (e.g., `ai2/tulu-3-dev`).
--priority	str	No (default: normal)	Job priority: low, normal, high, or urgent.
--preemptible	flag	No (default: false)	Whether the job can be preempted by higher-priority jobs.
--max_retries	int	No (default: 0)	Maximum number of automatic retries for failed tasks.
--shared_memory	str	No (default: 10.24gb)	Shared memory allocation for the container.
--description	str	No (default: "Beaker-Mason job.")	Human-readable description for the experiment.
--beaker_datasets	str (nargs=*)	No	Additional Beaker datasets to mount, formatted as `mount_path:beaker_id`.
--env	str (repeatable)	No	Additional environment variables in `name=value` format.
--secret	str (repeatable)	No	Additional secret environment variables in `name=value` format.
--non_resumable	flag	No	Disable automatic resumable mode for GRPO jobs.
--no_auto_dataset_cache	flag	No	Skip local dataset caching before submission.
--pure_docker_mode	flag	No	Run in pure Docker mode without mounting shared filesystem.
--timeout	str	No	Task timeout as duration string (e.g., `15m`, `1h`, `2h30m`).
-- [command...]	str	Yes	The training command separated by `--` delimiter (e.g., `-- python open_instruct/grpo_fast.py ...`).

Outputs

Name	Type	Description
Beaker Experiment	Beaker experiment object	A submitted experiment on the Beaker platform with a unique ID and tracking URL.
Console URL	str	Logged URL in format `https://beaker.org/ex/{experiment_id}` for monitoring.
Cached Dataset	directory (side effect)	Preprocessed dataset stored on shared storage if auto-caching is enabled.
Docker Image	container image (side effect)	Built and pushed Docker image when using `build_image_and_launch.sh`.

Usage Examples

Basic Usage

# Single-GPU GRPO training on Jupiter cluster
python mason.py \
    --cluster ai2/jupiter \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --workspace ai2/tulu-3-dev \
    -- python open_instruct/grpo_fast.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
        --dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
        --max_seq_length 2048 \
        --output_dir /output/

Tulu3 8B Example

# Multi-node GRPO training for Tulu 3 8B (2 nodes, 8 GPUs each)
python mason.py \
    --cluster ai2/jupiter ai2/saturn \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --num_nodes 2 \
    --workspace ai2/tulu-3-dev \
    --priority normal \
    --max_retries 3 \
    -- accelerate launch --num_processes 8 \
        open_instruct/grpo_fast.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
        --dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
        --max_seq_length 4096 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 16 \
        --learning_rate 5e-7 \
        --with_tracking \
        --output_dir /output/tulu3-8b-grpo

DPO Training Example

# Single-GPU DPO training via the wrapper script
./scripts/train/build_image_and_launch.sh scripts/train/debug/dpo/single_gpu.sh

# Or directly with mason
python mason.py \
    --cluster ai2/jupiter \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --workspace ai2/tulu-3-dev \
    -- python open_instruct/dpo.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-SFT \
        --dataset_mixer '{"allenai/tulu-3-pref-mixture": 1.0}' \
        --max_seq_length 2048 \
        --output_dir /output/tulu3-8b-dpo

External User Example

# External users (without Beaker credentials) will see the rendered command
# printed to console for manual execution on their own infrastructure.
python mason.py \
    --cluster my-cluster \
    --budget my-budget \
    --gpus 4 \
    -- python open_instruct/finetune.py \
        --model_name_or_path meta-llama/Llama-3.1-8B \
        --output_dir /output/my-sft-run

Dependencies

Package	Purpose
`beaker`	Python SDK for Beaker platform (experiment creation, secret management, dataset mounts)
`backoff`	Exponential backoff retry logic for experiment submission
`requests`	HTTP requests (used by backoff for timeout exception handling)
`rich`	Console output formatting with colored rules and styled text
`open_instruct.launch_utils`	Cluster constants (WEKA_CLUSTERS, GCP_CLUSTERS, INTERCONNECT_CLUSTERS), GCS operations, HF download utilities
`argparse`	CLI argument parsing
`subprocess`	Local dataset caching command execution
`docker`	Docker image building (via `build_image_and_launch.sh` wrapper)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment