Implementation:Allenai Open instruct Submit Eval Jobs
| Knowledge Sources | |
|---|---|
| Domains | Model Evaluation, Benchmarking, MLOps, Batch Job Submission |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for submitting parallel benchmark evaluation jobs for trained language models to the Beaker compute platform, provided by Open Instruct.
Description
The scripts/submit_eval_jobs.py script is a script-level CLI tool that automates the submission of evaluation experiments for a trained model checkpoint across a comprehensive suite of benchmarks. It reads a YAML-based default evaluation configuration (configs/beaker_configs/default_eval.yaml), creates individual task specifications for each selected benchmark, assembles them into a single Beaker experiment, and submits the experiment via the Beaker CLI.
The script handles the following key workflows:
- Benchmark configuration: For each of the 20+ supported experiment groups (mmlu_0shot, mmlu_5shot, gsm_direct, gsm_cot, MATH_cot, bbh_direct, bbh_cot, tydiqa_goldp_1shot, tydiqa_no_context_1shot, codex_eval_temp_0.1, codex_eval_temp_0.8, codex_evalplus_temp_0.1, codex_evalplus_temp_0.8, mbpp_evalplus_temp_0.1, mbpp_evalplus_temp_0.8, ifeval, truthfulqa, toxigen, xstest, alpaca_eval, alpaca_eval_2), it constructs the appropriate evaluation command with correct data directories, few-shot settings, and model paths.
- Model source resolution: Models can be loaded from HuggingFace Hub (prefixed with
hf-), from Beaker datasets (mounted at/model), or from local paths on shared storage (Weka). - Automatic resource scaling: Batch sizes are reduced and GPU counts are increased heuristically based on model size (13B, 30B, 65B, 70B parameter classes).
- Chat template selection: The chat formatting function is automatically chosen based on the model family (Tulu, OLMo, Llama 2, Zephyr, XWin) or overridden with the HuggingFace tokenizer template.
- OE-Eval integration: Optionally invokes AI2's OE-Eval framework for additional standardized benchmark suites via the
scripts/eval/oe-eval.shwrapper script. - HuggingFace Hub upload: Selected evaluation results can be uploaded to a HuggingFace dataset for public sharing.
Usage
Use this tool when you need to:
- Evaluate a model checkpoint across the standard Tulu 3 benchmark suite after training.
- Run a subset of benchmarks for quick iteration during development.
- Submit evaluations to both Open Instruct and OE-Eval frameworks simultaneously.
- Upload evaluation results to HuggingFace Hub for public reporting.
- Evaluate models of different sizes with automatic resource scaling.
Code Reference
Source Location
- Repository: Open Instruct
- File:
scripts/submit_eval_jobs.py, lines 1-734
Signature
# Script-level execution (no function wrapper; runs at module level)
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", type=str, default="ai2/tulu-3-results")
parser.add_argument("--model_name", type=str, default="hf-opt-7B")
parser.add_argument("--hf_revision", type=str, default=None)
parser.add_argument("--location", type=str, default=None)
parser.add_argument("--beaker_image", type=str, default="oe-eval-beaker/oe_eval_auto")
parser.add_argument("--beaker_subfolder", type=str, default=None)
parser.add_argument("--cluster", nargs="+",
default=["ai2/jupiter", "ai2/saturn", "ai2/ceres", "ai2/neptune"])
parser.add_argument("--is_tuned", action="store_true")
parser.add_argument("--use_hf_tokenizer_template", action="store_true")
parser.add_argument("--priority", type=str, default="low")
parser.add_argument("--preemptible", action="store_true", default=False)
parser.add_argument("--olmo", action="store_true")
parser.add_argument("--experiments", type=str, nargs="+", default=None)
parser.add_argument("--batch_size_reduction", type=int, default=None)
parser.add_argument("--gpu_multiplier", type=int, default=None)
parser.add_argument("--upload_to_hf", type=str, default=None)
parser.add_argument("--hf_upload_experiments", type=str, nargs="*", default=None)
parser.add_argument("--run_oe_eval_experiments", action="store_true")
parser.add_argument("--skip_oi_evals", action="store_true")
parser.add_argument("--oe_eval_task_suite", type=str, default="OLMO_3")
parser.add_argument("--oe_eval_max_length", type=int, default=4096)
parser.add_argument("--evaluate_on_weka", action="store_true")
parser.add_argument("--step", type=int, default=None)
parser.add_argument("--run_id", type=str, default=None)
parser.add_argument("--wandb_run_path", type=str, default=None)
parser.add_argument("--process_output", type=str, default="r1_style")
args = parser.parse_args()
Import
# Run as CLI script
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--workspace ai2/tulu-3-results
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model_name | str | No (default: hf-opt-7B) | Name of the model to evaluate. Prefix with hf- for HuggingFace models.
|
| --location | str | No | Model location: HuggingFace path, Beaker dataset ID, or local filesystem path. |
| --workspace | str | No (default: ai2/tulu-3-results) | Fully-qualified Beaker workspace in org/workspace format.
|
| --cluster | str (nargs=+) | No (default: ai2/jupiter, ai2/saturn, ai2/ceres, ai2/neptune) | Target Beaker cluster(s) for evaluation jobs. |
| --experiments | str (nargs=+) | No (default: all 20+ benchmarks) | Specific experiments to run (e.g., mmlu_5shot gsm_cot bbh_cot).
|
| --is_tuned | flag | No | Set when evaluating a tuned/instruction-following model (adds chat formatting). |
| --use_hf_tokenizer_template | flag | No | Override chat template with the model's HuggingFace tokenizer template. |
| --olmo | flag | No | Force OLMo chat format even if "olmo" is not in the model name. |
| --hf_revision | str | No | Specific HuggingFace model revision/commit to evaluate. |
| --beaker_subfolder | str | No | Checkpoint subfolder within the Beaker dataset (e.g., checkpoint-1000).
|
| --beaker_image | str | No (default: oe-eval-beaker/oe_eval_auto) | Beaker image to use for evaluation jobs. |
| --priority | str | No (default: low) | Beaker job priority: low, normal, high, or urgent. |
| --preemptible | flag | No | Run evaluation jobs as preemptible. |
| --batch_size_reduction | int | No | Factor to reduce evaluation batch size (auto-detected from model size if omitted). |
| --gpu_multiplier | int | No | Factor to multiply GPU count (auto-detected from model size if omitted). |
| --upload_to_hf | str | No | HuggingFace dataset path for result upload in format hf_dataset//hf_path.
|
| --hf_upload_experiments | str (nargs=*) | No | Which experiments to upload to HF (default: none for OI evals). |
| --run_oe_eval_experiments | flag | No | Additionally submit evaluations to AI2's OE-Eval framework. |
| --skip_oi_evals | flag | No | Skip Open Instruct native evaluations (use only OE-Eval). |
| --oe_eval_task_suite | str | No (default: OLMO_3) | OE-Eval task suite: OLMO_3, OLMO_3_UNSEEN, TULU_3_DEV, TULU_3_UNSEEN, SAFETY_EVAL, SAFETY_EVAL_REASONING. |
| --oe_eval_max_length | int | No (default: 4096) | Maximum sequence length for OE-Eval generation. |
| --evaluate_on_weka | flag | No | Run OE-Eval evaluations on Weka-enabled clusters. |
| --step | int | No | Step number for PostgreSQL logging integration. |
| --run_id | str | No | Unique run ID for PostgreSQL logging integration. |
| --wandb_run_path | str | No | W&B run path for PostgreSQL logging integration. |
| --process_output | str | No (default: r1_style) | Output processing mode for OE-Eval. |
| --add_stop_sequence | str (nargs=+) | No | eot_id|> for Llama 3). |
| --gsm_stop_at_double_newline | flag | No | Stop GSM generation at the first double newline. |
Outputs
| Name | Type | Description |
|---|---|---|
| Beaker Evaluation Experiment | Beaker experiment | A single Beaker experiment containing one task per selected benchmark, running in parallel. |
| Auto-created YAML Config | file | Experiment specification saved to configs/beaker_configs/auto_created/ for reproducibility.
|
| OE-Eval Experiments | Beaker experiments (optional) | Additional evaluation experiments submitted via the OE-Eval framework when --run_oe_eval_experiments is set.
|
| HuggingFace Upload | HF dataset entries (optional) | Evaluation results uploaded to HuggingFace Hub when --upload_to_hf is specified.
|
| Evaluation Results | JSON files | Per-benchmark results stored in each task's /output/ directory on Beaker.
|
Supported Benchmarks
| Experiment Group | Benchmark | Configuration | GPU Count |
|---|---|---|---|
| mmlu_0shot | MMLU | 0-shot, batch size 4 | 1 |
| mmlu_5shot | MMLU | 5-shot, batch size 4 | 1 |
| gsm_direct | GSM8K | 8-shot, no CoT, 200 examples, vLLM | 1 |
| gsm_cot | GSM8K | 8-shot, CoT, 200 examples, vLLM | 1 |
| MATH_cot | MATH | 4-shot, CoT, 200 examples, vLLM | 1 |
| bbh_direct | BIG-Bench Hard | 40 examples per task, no CoT, vLLM | 1 |
| bbh_cot | BIG-Bench Hard | 40 examples per task, CoT, vLLM | 1 |
| tydiqa_goldp_1shot | TyDiQA | 1-shot with gold passage, 100 per lang, vLLM | 1 |
| tydiqa_no_context_1shot | TyDiQA | 1-shot without context, 100 per lang, vLLM | 1 |
| codex_eval_temp_0.1 | HumanEval | Temperature 0.1, 20 samples, pass@{1,5,10,20}, vLLM | 1* |
| codex_eval_temp_0.8 | HumanEval | Temperature 0.8, 20 samples, pass@{1,5,10,20}, vLLM | 1* |
| codex_evalplus_temp_0.1 | HumanEval+ | Temperature 0.1, 20 samples, chat format, vLLM | 1* |
| codex_evalplus_temp_0.8 | HumanEval+ | Temperature 0.8, 20 samples, chat format, vLLM | 1* |
| mbpp_evalplus_temp_0.1 | MBPP+ | Temperature 0.1, 20 samples, chat format, vLLM | 1* |
| mbpp_evalplus_temp_0.8 | MBPP+ | Temperature 0.8, 20 samples, chat format, vLLM | 1* |
| ifeval | IFEval | Chat format, vLLM | 1 |
| truthfulqa | TruthfulQA | Truth + info + MC metrics, batch size 20 | 1 |
| toxigen | ToxiGen | Batch size 32, vLLM | 1 |
| xstest | XSTest | Batch size 32, vLLM | 1 |
| alpaca_eval | AlpacaEval | v1, chat format, vLLM | 1 |
| alpaca_eval_2 | AlpacaEval 2 | Length-controlled, chat format, vLLM | 1 |
* Codex/MBPP tasks may receive doubled GPU counts for larger models via the gpu_multiplier heuristic.
Usage Examples
Basic Usage
# Evaluate a HuggingFace model on all default benchmarks
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--workspace ai2/tulu-3-results
Tulu3 8B Example
# Full evaluation of Tulu 3 8B with OE-Eval and HuggingFace upload
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--use_hf_tokenizer_template \
--workspace ai2/tulu-3-results \
--cluster ai2/jupiter ai2/saturn ai2/ceres ai2/neptune \
--run_oe_eval_experiments \
--oe_eval_task_suite OLMO_3 \
--evaluate_on_weka \
--upload_to_hf "allenai/tulu-3-evals//results/Llama-3.1-Tulu-3-8B" \
--hf_upload_experiments alpaca_eval alpaca_eval_2
Quick Iteration Example
# Evaluate on only core benchmarks for fast feedback
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--experiments mmlu_5shot gsm_cot bbh_cot ifeval \
--workspace ai2/tulu-3-results
Large Model Example (70B)
# Evaluate a 70B model with automatic batch/GPU scaling
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-70B \
--location allenai/Llama-3.1-Tulu-3-70B \
--is_tuned \
--workspace ai2/tulu-3-results \
--priority normal
# Batch size automatically reduced by 4x, GPU count automatically doubled for large models
Beaker Dataset Model Example
# Evaluate a model stored as a Beaker dataset with a specific checkpoint subfolder
python scripts/submit_eval_jobs.py \
--model_name my-tulu3-experiment \
--location 01HQXGAYGCS6D4ZK51K83CM49Y \
--beaker_subfolder checkpoint-5000 \
--is_tuned \
--workspace ai2/tulu-3-results
OE-Eval Only Example
# Run only OE-Eval experiments, skip Open Instruct native evals
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--skip_oi_evals \
--run_oe_eval_experiments \
--oe_eval_task_suite SAFETY_EVAL \
--workspace ai2/tulu-3-results
Dependencies
| Package | Purpose |
|---|---|
beaker (CLI) |
Beaker experiment creation via beaker experiment create command
|
yaml |
Parsing and writing Beaker experiment YAML configurations |
subprocess |
Launching Beaker CLI commands and OE-Eval wrapper script |
argparse |
CLI argument parsing |
open_instruct.launch_utils |
Cluster constants (WEKA_CLUSTERS) for storage mount decisions |
open_instruct.utils |
General utility functions |
configs/beaker_configs/default_eval.yaml |
Base evaluation task template with default image, datasets, and resource settings |
scripts/eval/oe-eval.sh |
Wrapper script for submitting OE-Eval framework evaluations |