Implementation:Allenai Open instruct Submit Eval Jobs

Knowledge Sources	Open Instruct
Domains	Model Evaluation, Benchmarking, MLOps, Batch Job Submission
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for submitting parallel benchmark evaluation jobs for trained language models to the Beaker compute platform, provided by Open Instruct.

Description

The scripts/submit_eval_jobs.py script is a script-level CLI tool that automates the submission of evaluation experiments for a trained model checkpoint across a comprehensive suite of benchmarks. It reads a YAML-based default evaluation configuration (configs/beaker_configs/default_eval.yaml), creates individual task specifications for each selected benchmark, assembles them into a single Beaker experiment, and submits the experiment via the Beaker CLI.

The script handles the following key workflows:

Benchmark configuration: For each of the 20+ supported experiment groups (mmlu_0shot, mmlu_5shot, gsm_direct, gsm_cot, MATH_cot, bbh_direct, bbh_cot, tydiqa_goldp_1shot, tydiqa_no_context_1shot, codex_eval_temp_0.1, codex_eval_temp_0.8, codex_evalplus_temp_0.1, codex_evalplus_temp_0.8, mbpp_evalplus_temp_0.1, mbpp_evalplus_temp_0.8, ifeval, truthfulqa, toxigen, xstest, alpaca_eval, alpaca_eval_2), it constructs the appropriate evaluation command with correct data directories, few-shot settings, and model paths.
Model source resolution: Models can be loaded from HuggingFace Hub (prefixed with hf-), from Beaker datasets (mounted at /model), or from local paths on shared storage (Weka).
Automatic resource scaling: Batch sizes are reduced and GPU counts are increased heuristically based on model size (13B, 30B, 65B, 70B parameter classes).
Chat template selection: The chat formatting function is automatically chosen based on the model family (Tulu, OLMo, Llama 2, Zephyr, XWin) or overridden with the HuggingFace tokenizer template.
OE-Eval integration: Optionally invokes AI2's OE-Eval framework for additional standardized benchmark suites via the scripts/eval/oe-eval.sh wrapper script.
HuggingFace Hub upload: Selected evaluation results can be uploaded to a HuggingFace dataset for public sharing.

Usage

Use this tool when you need to:

Evaluate a model checkpoint across the standard Tulu 3 benchmark suite after training.
Run a subset of benchmarks for quick iteration during development.
Submit evaluations to both Open Instruct and OE-Eval frameworks simultaneously.
Upload evaluation results to HuggingFace Hub for public reporting.
Evaluate models of different sizes with automatic resource scaling.

Code Reference

Source Location

Repository: Open Instruct
File: scripts/submit_eval_jobs.py, lines 1-734

Signature

# Script-level execution (no function wrapper; runs at module level)
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", type=str, default="ai2/tulu-3-results")
parser.add_argument("--model_name", type=str, default="hf-opt-7B")
parser.add_argument("--hf_revision", type=str, default=None)
parser.add_argument("--location", type=str, default=None)
parser.add_argument("--beaker_image", type=str, default="oe-eval-beaker/oe_eval_auto")
parser.add_argument("--beaker_subfolder", type=str, default=None)
parser.add_argument("--cluster", nargs="+",
                    default=["ai2/jupiter", "ai2/saturn", "ai2/ceres", "ai2/neptune"])
parser.add_argument("--is_tuned", action="store_true")
parser.add_argument("--use_hf_tokenizer_template", action="store_true")
parser.add_argument("--priority", type=str, default="low")
parser.add_argument("--preemptible", action="store_true", default=False)
parser.add_argument("--olmo", action="store_true")
parser.add_argument("--experiments", type=str, nargs="+", default=None)
parser.add_argument("--batch_size_reduction", type=int, default=None)
parser.add_argument("--gpu_multiplier", type=int, default=None)
parser.add_argument("--upload_to_hf", type=str, default=None)
parser.add_argument("--hf_upload_experiments", type=str, nargs="*", default=None)
parser.add_argument("--run_oe_eval_experiments", action="store_true")
parser.add_argument("--skip_oi_evals", action="store_true")
parser.add_argument("--oe_eval_task_suite", type=str, default="OLMO_3")
parser.add_argument("--oe_eval_max_length", type=int, default=4096)
parser.add_argument("--evaluate_on_weka", action="store_true")
parser.add_argument("--step", type=int, default=None)
parser.add_argument("--run_id", type=str, default=None)
parser.add_argument("--wandb_run_path", type=str, default=None)
parser.add_argument("--process_output", type=str, default="r1_style")
args = parser.parse_args()

Import

# Run as CLI script
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --workspace ai2/tulu-3-results

I/O Contract

Inputs

Name	Type	Required	Description
--model_name	str	No (default: hf-opt-7B)	Name of the model to evaluate. Prefix with `hf-` for HuggingFace models.
--location	str	No	Model location: HuggingFace path, Beaker dataset ID, or local filesystem path.
--workspace	str	No (default: ai2/tulu-3-results)	Fully-qualified Beaker workspace in `org/workspace` format.
--cluster	str (nargs=+)	No (default: ai2/jupiter, ai2/saturn, ai2/ceres, ai2/neptune)	Target Beaker cluster(s) for evaluation jobs.
--experiments	str (nargs=+)	No (default: all 20+ benchmarks)	Specific experiments to run (e.g., `mmlu_5shot gsm_cot bbh_cot`).
--is_tuned	flag	No	Set when evaluating a tuned/instruction-following model (adds chat formatting).
--use_hf_tokenizer_template	flag	No	Override chat template with the model's HuggingFace tokenizer template.
--olmo	flag	No	Force OLMo chat format even if "olmo" is not in the model name.
--hf_revision	str	No	Specific HuggingFace model revision/commit to evaluate.
--beaker_subfolder	str	No	Checkpoint subfolder within the Beaker dataset (e.g., `checkpoint-1000`).
--beaker_image	str	No (default: oe-eval-beaker/oe_eval_auto)	Beaker image to use for evaluation jobs.
--priority	str	No (default: low)	Beaker job priority: low, normal, high, or urgent.
--preemptible	flag	No	Run evaluation jobs as preemptible.
--batch_size_reduction	int	No	Factor to reduce evaluation batch size (auto-detected from model size if omitted).
--gpu_multiplier	int	No	Factor to multiply GPU count (auto-detected from model size if omitted).
--upload_to_hf	str	No	HuggingFace dataset path for result upload in format `hf_dataset//hf_path`.
--hf_upload_experiments	str (nargs=*)	No	Which experiments to upload to HF (default: none for OI evals).
--run_oe_eval_experiments	flag	No	Additionally submit evaluations to AI2's OE-Eval framework.
--skip_oi_evals	flag	No	Skip Open Instruct native evaluations (use only OE-Eval).
--oe_eval_task_suite	str	No (default: OLMO_3)	OE-Eval task suite: OLMO_3, OLMO_3_UNSEEN, TULU_3_DEV, TULU_3_UNSEEN, SAFETY_EVAL, SAFETY_EVAL_REASONING.
--oe_eval_max_length	int	No (default: 4096)	Maximum sequence length for OE-Eval generation.
--evaluate_on_weka	flag	No	Run OE-Eval evaluations on Weka-enabled clusters.
--step	int	No	Step number for PostgreSQL logging integration.
--run_id	str	No	Unique run ID for PostgreSQL logging integration.
--wandb_run_path	str	No	W&B run path for PostgreSQL logging integration.
--process_output	str	No (default: r1_style)	Output processing mode for OE-Eval.
--add_stop_sequence	str (nargs=+)	No	eot_id\|> for Llama 3).
--gsm_stop_at_double_newline	flag	No	Stop GSM generation at the first double newline.

Outputs

Name	Type	Description
Beaker Evaluation Experiment	Beaker experiment	A single Beaker experiment containing one task per selected benchmark, running in parallel.
Auto-created YAML Config	file	Experiment specification saved to `configs/beaker_configs/auto_created/` for reproducibility.
OE-Eval Experiments	Beaker experiments (optional)	Additional evaluation experiments submitted via the OE-Eval framework when `--run_oe_eval_experiments` is set.
HuggingFace Upload	HF dataset entries (optional)	Evaluation results uploaded to HuggingFace Hub when `--upload_to_hf` is specified.
Evaluation Results	JSON files	Per-benchmark results stored in each task's `/output/` directory on Beaker.

Supported Benchmarks

Experiment Group	Benchmark	Configuration	GPU Count
mmlu_0shot	MMLU	0-shot, batch size 4	1
mmlu_5shot	MMLU	5-shot, batch size 4	1
gsm_direct	GSM8K	8-shot, no CoT, 200 examples, vLLM	1
gsm_cot	GSM8K	8-shot, CoT, 200 examples, vLLM	1
MATH_cot	MATH	4-shot, CoT, 200 examples, vLLM	1
bbh_direct	BIG-Bench Hard	40 examples per task, no CoT, vLLM	1
bbh_cot	BIG-Bench Hard	40 examples per task, CoT, vLLM	1
tydiqa_goldp_1shot	TyDiQA	1-shot with gold passage, 100 per lang, vLLM	1
tydiqa_no_context_1shot	TyDiQA	1-shot without context, 100 per lang, vLLM	1
codex_eval_temp_0.1	HumanEval	Temperature 0.1, 20 samples, pass@{1,5,10,20}, vLLM	1*
codex_eval_temp_0.8	HumanEval	Temperature 0.8, 20 samples, pass@{1,5,10,20}, vLLM	1*
codex_evalplus_temp_0.1	HumanEval+	Temperature 0.1, 20 samples, chat format, vLLM	1*
codex_evalplus_temp_0.8	HumanEval+	Temperature 0.8, 20 samples, chat format, vLLM	1*
mbpp_evalplus_temp_0.1	MBPP+	Temperature 0.1, 20 samples, chat format, vLLM	1*
mbpp_evalplus_temp_0.8	MBPP+	Temperature 0.8, 20 samples, chat format, vLLM	1*
ifeval	IFEval	Chat format, vLLM	1
truthfulqa	TruthfulQA	Truth + info + MC metrics, batch size 20	1
toxigen	ToxiGen	Batch size 32, vLLM	1
xstest	XSTest	Batch size 32, vLLM	1
alpaca_eval	AlpacaEval	v1, chat format, vLLM	1
alpaca_eval_2	AlpacaEval 2	Length-controlled, chat format, vLLM	1

* Codex/MBPP tasks may receive doubled GPU counts for larger models via the gpu_multiplier heuristic.

Usage Examples

Basic Usage

# Evaluate a HuggingFace model on all default benchmarks
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --workspace ai2/tulu-3-results

Tulu3 8B Example

# Full evaluation of Tulu 3 8B with OE-Eval and HuggingFace upload
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --use_hf_tokenizer_template \
    --workspace ai2/tulu-3-results \
    --cluster ai2/jupiter ai2/saturn ai2/ceres ai2/neptune \
    --run_oe_eval_experiments \
    --oe_eval_task_suite OLMO_3 \
    --evaluate_on_weka \
    --upload_to_hf "allenai/tulu-3-evals//results/Llama-3.1-Tulu-3-8B" \
    --hf_upload_experiments alpaca_eval alpaca_eval_2

Quick Iteration Example

# Evaluate on only core benchmarks for fast feedback
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --experiments mmlu_5shot gsm_cot bbh_cot ifeval \
    --workspace ai2/tulu-3-results

Large Model Example (70B)

# Evaluate a 70B model with automatic batch/GPU scaling
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-70B \
    --location allenai/Llama-3.1-Tulu-3-70B \
    --is_tuned \
    --workspace ai2/tulu-3-results \
    --priority normal
# Batch size automatically reduced by 4x, GPU count automatically doubled for large models

Beaker Dataset Model Example

# Evaluate a model stored as a Beaker dataset with a specific checkpoint subfolder
python scripts/submit_eval_jobs.py \
    --model_name my-tulu3-experiment \
    --location 01HQXGAYGCS6D4ZK51K83CM49Y \
    --beaker_subfolder checkpoint-5000 \
    --is_tuned \
    --workspace ai2/tulu-3-results

OE-Eval Only Example

# Run only OE-Eval experiments, skip Open Instruct native evals
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --skip_oi_evals \
    --run_oe_eval_experiments \
    --oe_eval_task_suite SAFETY_EVAL \
    --workspace ai2/tulu-3-results

Dependencies

Package	Purpose
`beaker` (CLI)	Beaker experiment creation via `beaker experiment create` command
`yaml`	Parsing and writing Beaker experiment YAML configurations
`subprocess`	Launching Beaker CLI commands and OE-Eval wrapper script
`argparse`	CLI argument parsing
`open_instruct.launch_utils`	Cluster constants (WEKA_CLUSTERS) for storage mount decisions
`open_instruct.utils`	General utility functions
`configs/beaker_configs/default_eval.yaml`	Base evaluation task template with default image, datasets, and resource settings
`scripts/eval/oe-eval.sh`	Wrapper script for submitting OE-Eval framework evaluations

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment