Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Mason Main

From Leeroopedia


Knowledge Sources
Domains MLOps, Distributed Training, Experiment Management, CLI Tooling
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for launching containerized training experiments on Beaker compute clusters provided by Open Instruct's Mason CLI.

Description

The main() function in mason.py is the CLI entrypoint for the Mason experiment launcher. It orchestrates the complete lifecycle of submitting an Open Instruct training job to Beaker: parsing command-line arguments, detecting whether the user is internal (AI2) or external, optionally caching datasets locally, building internal commands with proper environment variables and accelerate configurations, constructing a Beaker experiment specification, and submitting the experiment to the Beaker platform with exponential backoff retry logic.

The function handles several complex workflows:

  • User detection: Checks for Beaker credentials (~/.beaker/config.yml or BEAKER_TOKEN) to distinguish AI2 internal users from external users. External users receive the rendered command for manual execution.
  • Command construction: Calls make_internal_command() for each command, which handles dataset caching, output directory rewriting, GCP model uploads, multi-node accelerate configuration, and proper shell escaping.
  • Task specification: Calls make_task_spec() to build a BeakerTaskSpec with GPU resources, cluster constraints, environment variables (including secrets for HF_TOKEN, WANDB_API_KEY, etc.), data mounts (Weka or GCS), and multi-node settings.
  • Experiment submission: Creates a BeakerExperimentSpec with budget, retry policy, and all task specs, then submits it with retry logic (up to 5 attempts with exponential backoff) to handle transient timeout errors.

Usage

Use this tool when you need to:

  • Launch any Open Instruct training job (SFT, DPO, GRPO, reward modeling) on Beaker clusters.
  • Submit multi-node distributed training experiments with automatic accelerate configuration.
  • Run experiments that require dataset pre-caching, automatic output directory management, or GCP model staging.
  • Execute custom Python commands on Beaker GPU infrastructure outside of Open Instruct training.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: mason.py, lines 833-895
  • Supporting functions: get_args() (L108-222), make_internal_command() (L426-737), make_task_spec() (L740-799)

Signature

def main():
    args, commands = get_args()
    # Detects AI2 vs external user based on Beaker credentials
    # For external users: prints rendered commands and returns
    # For internal users:
    #   1. Builds internal commands (caching, output dir, GCP staging)
    #   2. Constructs BeakerExperimentSpec with tasks, budget, retry
    #   3. Submits experiment with exponential backoff retry

The get_args() function defines the full argument parser:

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--cluster", type=str, nargs="+", required=True,
                        help="Beaker clusters on which the job could be run.")
    parser.add_argument("--budget", type=str, required=True, help="Budget to use.")
    parser.add_argument("--gpus", type=int, default=0, help="Number of gpus")
    parser.add_argument("--num_nodes", type=int, default=1, help="Number of nodes")
    parser.add_argument("--image", type=str,
                        default="ai2/cuda11.8-cudnn8-dev-ubuntu20.04",
                        help="Beaker base image.")
    parser.add_argument("--workspace", type=str, default=None,
                        help="The Beaker workspace to use.")
    parser.add_argument("--priority", type=str, default="normal",
                        help="Beaker job priority.")
    parser.add_argument("--preemptible", action="store_true",
                        help="Run as preemptible")
    parser.add_argument("--max_retries", type=int, default=0,
                        help="Number of retries")
    parser.add_argument("--shared_memory", type=str, default="10.24gb")
    parser.add_argument("--description", type=str,
                        default="Beaker-Mason job.")
    parser.add_argument("--task_name", type=str, default="beaker_mason")
    parser.add_argument("--beaker_datasets", nargs="*",
                        type=parse_beaker_dataset, default=[])
    parser.add_argument("--pure_docker_mode", action="store_true")
    parser.add_argument("--non_resumable", action="store_true")
    parser.add_argument("--no_auto_dataset_cache", action="store_true")
    parser.add_argument("--auto_output_dir_path", type=str,
                        default="/weka/oe-adapt-default/allennlp/deletable_checkpoint")
    parser.add_argument("--env", type=parse_env_var, action="append", default=[])
    parser.add_argument("--secret", type=parse_env_var, action="append", default=[])
    parser.add_argument("--no-host-networking", action="store_true")
    parser.add_argument("--timeout", type=str, default=None)
    # ...
    mason_args, command_args = parser.parse_known_args()
    commands = parse_commands(command_args)
    return mason_args, commands

Import

# Run as CLI tool (primary usage)
python mason.py --cluster ai2/jupiter --budget ai2/oe-adapt --gpus 8 \
    -- python open_instruct/grpo_fast.py [training_args...]

# Or via the wrapper script which builds a Docker image first
./scripts/train/build_image_and_launch.sh scripts/train/debug/single_gpu_on_beaker.sh

I/O Contract

Inputs

Name Type Required Description
--cluster str (nargs=+) Yes Target Beaker cluster(s) for scheduling (e.g., ai2/jupiter, ai2/saturn).
--budget str Yes Organizational budget account for billing (e.g., ai2/oe-adapt).
--gpus int No (default: 0) Number of GPUs per node to allocate.
--num_nodes int No (default: 1) Number of nodes for distributed training. Values > 1 enable multi-node mode with leader election.
--image str No (default: ai2/cuda11.8-cudnn8-dev-ubuntu20.04) Beaker base Docker image to use for the experiment.
--workspace str No (default: user default) Beaker workspace for experiment organization (e.g., ai2/tulu-3-dev).
--priority str No (default: normal) Job priority: low, normal, high, or urgent.
--preemptible flag No (default: false) Whether the job can be preempted by higher-priority jobs.
--max_retries int No (default: 0) Maximum number of automatic retries for failed tasks.
--shared_memory str No (default: 10.24gb) Shared memory allocation for the container.
--description str No (default: "Beaker-Mason job.") Human-readable description for the experiment.
--beaker_datasets str (nargs=*) No Additional Beaker datasets to mount, formatted as mount_path:beaker_id.
--env str (repeatable) No Additional environment variables in name=value format.
--secret str (repeatable) No Additional secret environment variables in name=value format.
--non_resumable flag No Disable automatic resumable mode for GRPO jobs.
--no_auto_dataset_cache flag No Skip local dataset caching before submission.
--pure_docker_mode flag No Run in pure Docker mode without mounting shared filesystem.
--timeout str No Task timeout as duration string (e.g., 15m, 1h, 2h30m).
-- [command...] str Yes The training command separated by -- delimiter (e.g., -- python open_instruct/grpo_fast.py ...).

Outputs

Name Type Description
Beaker Experiment Beaker experiment object A submitted experiment on the Beaker platform with a unique ID and tracking URL.
Console URL str Logged URL in format https://beaker.org/ex/{experiment_id} for monitoring.
Cached Dataset directory (side effect) Preprocessed dataset stored on shared storage if auto-caching is enabled.
Docker Image container image (side effect) Built and pushed Docker image when using build_image_and_launch.sh.

Usage Examples

Basic Usage

# Single-GPU GRPO training on Jupiter cluster
python mason.py \
    --cluster ai2/jupiter \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --workspace ai2/tulu-3-dev \
    -- python open_instruct/grpo_fast.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
        --dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
        --max_seq_length 2048 \
        --output_dir /output/

Tulu3 8B Example

# Multi-node GRPO training for Tulu 3 8B (2 nodes, 8 GPUs each)
python mason.py \
    --cluster ai2/jupiter ai2/saturn \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --num_nodes 2 \
    --workspace ai2/tulu-3-dev \
    --priority normal \
    --max_retries 3 \
    -- accelerate launch --num_processes 8 \
        open_instruct/grpo_fast.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B \
        --dataset_mixer '{"allenai/tulu-3-sft-mixture": 1.0}' \
        --max_seq_length 4096 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 16 \
        --learning_rate 5e-7 \
        --with_tracking \
        --output_dir /output/tulu3-8b-grpo

DPO Training Example

# Single-GPU DPO training via the wrapper script
./scripts/train/build_image_and_launch.sh scripts/train/debug/dpo/single_gpu.sh

# Or directly with mason
python mason.py \
    --cluster ai2/jupiter \
    --budget ai2/oe-adapt \
    --gpus 8 \
    --workspace ai2/tulu-3-dev \
    -- python open_instruct/dpo.py \
        --model_name_or_path allenai/Llama-3.1-Tulu-3-8B-SFT \
        --dataset_mixer '{"allenai/tulu-3-pref-mixture": 1.0}' \
        --max_seq_length 2048 \
        --output_dir /output/tulu3-8b-dpo

External User Example

# External users (without Beaker credentials) will see the rendered command
# printed to console for manual execution on their own infrastructure.
python mason.py \
    --cluster my-cluster \
    --budget my-budget \
    --gpus 4 \
    -- python open_instruct/finetune.py \
        --model_name_or_path meta-llama/Llama-3.1-8B \
        --output_dir /output/my-sft-run

Dependencies

Package Purpose
beaker Python SDK for Beaker platform (experiment creation, secret management, dataset mounts)
backoff Exponential backoff retry logic for experiment submission
requests HTTP requests (used by backoff for timeout exception handling)
rich Console output formatting with colored rules and styled text
open_instruct.launch_utils Cluster constants (WEKA_CLUSTERS, GCP_CLUSTERS, INTERCONNECT_CLUSTERS), GCS operations, HF download utilities
argparse CLI argument parsing
subprocess Local dataset caching command execution
docker Docker image building (via build_image_and_launch.sh wrapper)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment