Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Submit Eval Jobs

From Leeroopedia


Knowledge Sources
Domains Model Evaluation, Benchmarking, MLOps, Batch Job Submission
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for submitting parallel benchmark evaluation jobs for trained language models to the Beaker compute platform, provided by Open Instruct.

Description

The scripts/submit_eval_jobs.py script is a script-level CLI tool that automates the submission of evaluation experiments for a trained model checkpoint across a comprehensive suite of benchmarks. It reads a YAML-based default evaluation configuration (configs/beaker_configs/default_eval.yaml), creates individual task specifications for each selected benchmark, assembles them into a single Beaker experiment, and submits the experiment via the Beaker CLI.

The script handles the following key workflows:

  • Benchmark configuration: For each of the 20+ supported experiment groups (mmlu_0shot, mmlu_5shot, gsm_direct, gsm_cot, MATH_cot, bbh_direct, bbh_cot, tydiqa_goldp_1shot, tydiqa_no_context_1shot, codex_eval_temp_0.1, codex_eval_temp_0.8, codex_evalplus_temp_0.1, codex_evalplus_temp_0.8, mbpp_evalplus_temp_0.1, mbpp_evalplus_temp_0.8, ifeval, truthfulqa, toxigen, xstest, alpaca_eval, alpaca_eval_2), it constructs the appropriate evaluation command with correct data directories, few-shot settings, and model paths.
  • Model source resolution: Models can be loaded from HuggingFace Hub (prefixed with hf-), from Beaker datasets (mounted at /model), or from local paths on shared storage (Weka).
  • Automatic resource scaling: Batch sizes are reduced and GPU counts are increased heuristically based on model size (13B, 30B, 65B, 70B parameter classes).
  • Chat template selection: The chat formatting function is automatically chosen based on the model family (Tulu, OLMo, Llama 2, Zephyr, XWin) or overridden with the HuggingFace tokenizer template.
  • OE-Eval integration: Optionally invokes AI2's OE-Eval framework for additional standardized benchmark suites via the scripts/eval/oe-eval.sh wrapper script.
  • HuggingFace Hub upload: Selected evaluation results can be uploaded to a HuggingFace dataset for public sharing.

Usage

Use this tool when you need to:

  • Evaluate a model checkpoint across the standard Tulu 3 benchmark suite after training.
  • Run a subset of benchmarks for quick iteration during development.
  • Submit evaluations to both Open Instruct and OE-Eval frameworks simultaneously.
  • Upload evaluation results to HuggingFace Hub for public reporting.
  • Evaluate models of different sizes with automatic resource scaling.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: scripts/submit_eval_jobs.py, lines 1-734

Signature

# Script-level execution (no function wrapper; runs at module level)
parser = argparse.ArgumentParser()
parser.add_argument("--workspace", type=str, default="ai2/tulu-3-results")
parser.add_argument("--model_name", type=str, default="hf-opt-7B")
parser.add_argument("--hf_revision", type=str, default=None)
parser.add_argument("--location", type=str, default=None)
parser.add_argument("--beaker_image", type=str, default="oe-eval-beaker/oe_eval_auto")
parser.add_argument("--beaker_subfolder", type=str, default=None)
parser.add_argument("--cluster", nargs="+",
                    default=["ai2/jupiter", "ai2/saturn", "ai2/ceres", "ai2/neptune"])
parser.add_argument("--is_tuned", action="store_true")
parser.add_argument("--use_hf_tokenizer_template", action="store_true")
parser.add_argument("--priority", type=str, default="low")
parser.add_argument("--preemptible", action="store_true", default=False)
parser.add_argument("--olmo", action="store_true")
parser.add_argument("--experiments", type=str, nargs="+", default=None)
parser.add_argument("--batch_size_reduction", type=int, default=None)
parser.add_argument("--gpu_multiplier", type=int, default=None)
parser.add_argument("--upload_to_hf", type=str, default=None)
parser.add_argument("--hf_upload_experiments", type=str, nargs="*", default=None)
parser.add_argument("--run_oe_eval_experiments", action="store_true")
parser.add_argument("--skip_oi_evals", action="store_true")
parser.add_argument("--oe_eval_task_suite", type=str, default="OLMO_3")
parser.add_argument("--oe_eval_max_length", type=int, default=4096)
parser.add_argument("--evaluate_on_weka", action="store_true")
parser.add_argument("--step", type=int, default=None)
parser.add_argument("--run_id", type=str, default=None)
parser.add_argument("--wandb_run_path", type=str, default=None)
parser.add_argument("--process_output", type=str, default="r1_style")
args = parser.parse_args()

Import

# Run as CLI script
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --workspace ai2/tulu-3-results

I/O Contract

Inputs

Name Type Required Description
--model_name str No (default: hf-opt-7B) Name of the model to evaluate. Prefix with hf- for HuggingFace models.
--location str No Model location: HuggingFace path, Beaker dataset ID, or local filesystem path.
--workspace str No (default: ai2/tulu-3-results) Fully-qualified Beaker workspace in org/workspace format.
--cluster str (nargs=+) No (default: ai2/jupiter, ai2/saturn, ai2/ceres, ai2/neptune) Target Beaker cluster(s) for evaluation jobs.
--experiments str (nargs=+) No (default: all 20+ benchmarks) Specific experiments to run (e.g., mmlu_5shot gsm_cot bbh_cot).
--is_tuned flag No Set when evaluating a tuned/instruction-following model (adds chat formatting).
--use_hf_tokenizer_template flag No Override chat template with the model's HuggingFace tokenizer template.
--olmo flag No Force OLMo chat format even if "olmo" is not in the model name.
--hf_revision str No Specific HuggingFace model revision/commit to evaluate.
--beaker_subfolder str No Checkpoint subfolder within the Beaker dataset (e.g., checkpoint-1000).
--beaker_image str No (default: oe-eval-beaker/oe_eval_auto) Beaker image to use for evaluation jobs.
--priority str No (default: low) Beaker job priority: low, normal, high, or urgent.
--preemptible flag No Run evaluation jobs as preemptible.
--batch_size_reduction int No Factor to reduce evaluation batch size (auto-detected from model size if omitted).
--gpu_multiplier int No Factor to multiply GPU count (auto-detected from model size if omitted).
--upload_to_hf str No HuggingFace dataset path for result upload in format hf_dataset//hf_path.
--hf_upload_experiments str (nargs=*) No Which experiments to upload to HF (default: none for OI evals).
--run_oe_eval_experiments flag No Additionally submit evaluations to AI2's OE-Eval framework.
--skip_oi_evals flag No Skip Open Instruct native evaluations (use only OE-Eval).
--oe_eval_task_suite str No (default: OLMO_3) OE-Eval task suite: OLMO_3, OLMO_3_UNSEEN, TULU_3_DEV, TULU_3_UNSEEN, SAFETY_EVAL, SAFETY_EVAL_REASONING.
--oe_eval_max_length int No (default: 4096) Maximum sequence length for OE-Eval generation.
--evaluate_on_weka flag No Run OE-Eval evaluations on Weka-enabled clusters.
--step int No Step number for PostgreSQL logging integration.
--run_id str No Unique run ID for PostgreSQL logging integration.
--wandb_run_path str No W&B run path for PostgreSQL logging integration.
--process_output str No (default: r1_style) Output processing mode for OE-Eval.
--add_stop_sequence str (nargs=+) No eot_id|> for Llama 3).
--gsm_stop_at_double_newline flag No Stop GSM generation at the first double newline.

Outputs

Name Type Description
Beaker Evaluation Experiment Beaker experiment A single Beaker experiment containing one task per selected benchmark, running in parallel.
Auto-created YAML Config file Experiment specification saved to configs/beaker_configs/auto_created/ for reproducibility.
OE-Eval Experiments Beaker experiments (optional) Additional evaluation experiments submitted via the OE-Eval framework when --run_oe_eval_experiments is set.
HuggingFace Upload HF dataset entries (optional) Evaluation results uploaded to HuggingFace Hub when --upload_to_hf is specified.
Evaluation Results JSON files Per-benchmark results stored in each task's /output/ directory on Beaker.

Supported Benchmarks

Experiment Group Benchmark Configuration GPU Count
mmlu_0shot MMLU 0-shot, batch size 4 1
mmlu_5shot MMLU 5-shot, batch size 4 1
gsm_direct GSM8K 8-shot, no CoT, 200 examples, vLLM 1
gsm_cot GSM8K 8-shot, CoT, 200 examples, vLLM 1
MATH_cot MATH 4-shot, CoT, 200 examples, vLLM 1
bbh_direct BIG-Bench Hard 40 examples per task, no CoT, vLLM 1
bbh_cot BIG-Bench Hard 40 examples per task, CoT, vLLM 1
tydiqa_goldp_1shot TyDiQA 1-shot with gold passage, 100 per lang, vLLM 1
tydiqa_no_context_1shot TyDiQA 1-shot without context, 100 per lang, vLLM 1
codex_eval_temp_0.1 HumanEval Temperature 0.1, 20 samples, pass@{1,5,10,20}, vLLM 1*
codex_eval_temp_0.8 HumanEval Temperature 0.8, 20 samples, pass@{1,5,10,20}, vLLM 1*
codex_evalplus_temp_0.1 HumanEval+ Temperature 0.1, 20 samples, chat format, vLLM 1*
codex_evalplus_temp_0.8 HumanEval+ Temperature 0.8, 20 samples, chat format, vLLM 1*
mbpp_evalplus_temp_0.1 MBPP+ Temperature 0.1, 20 samples, chat format, vLLM 1*
mbpp_evalplus_temp_0.8 MBPP+ Temperature 0.8, 20 samples, chat format, vLLM 1*
ifeval IFEval Chat format, vLLM 1
truthfulqa TruthfulQA Truth + info + MC metrics, batch size 20 1
toxigen ToxiGen Batch size 32, vLLM 1
xstest XSTest Batch size 32, vLLM 1
alpaca_eval AlpacaEval v1, chat format, vLLM 1
alpaca_eval_2 AlpacaEval 2 Length-controlled, chat format, vLLM 1

* Codex/MBPP tasks may receive doubled GPU counts for larger models via the gpu_multiplier heuristic.

Usage Examples

Basic Usage

# Evaluate a HuggingFace model on all default benchmarks
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --workspace ai2/tulu-3-results

Tulu3 8B Example

# Full evaluation of Tulu 3 8B with OE-Eval and HuggingFace upload
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --use_hf_tokenizer_template \
    --workspace ai2/tulu-3-results \
    --cluster ai2/jupiter ai2/saturn ai2/ceres ai2/neptune \
    --run_oe_eval_experiments \
    --oe_eval_task_suite OLMO_3 \
    --evaluate_on_weka \
    --upload_to_hf "allenai/tulu-3-evals//results/Llama-3.1-Tulu-3-8B" \
    --hf_upload_experiments alpaca_eval alpaca_eval_2

Quick Iteration Example

# Evaluate on only core benchmarks for fast feedback
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --experiments mmlu_5shot gsm_cot bbh_cot ifeval \
    --workspace ai2/tulu-3-results

Large Model Example (70B)

# Evaluate a 70B model with automatic batch/GPU scaling
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-70B \
    --location allenai/Llama-3.1-Tulu-3-70B \
    --is_tuned \
    --workspace ai2/tulu-3-results \
    --priority normal
# Batch size automatically reduced by 4x, GPU count automatically doubled for large models

Beaker Dataset Model Example

# Evaluate a model stored as a Beaker dataset with a specific checkpoint subfolder
python scripts/submit_eval_jobs.py \
    --model_name my-tulu3-experiment \
    --location 01HQXGAYGCS6D4ZK51K83CM49Y \
    --beaker_subfolder checkpoint-5000 \
    --is_tuned \
    --workspace ai2/tulu-3-results

OE-Eval Only Example

# Run only OE-Eval experiments, skip Open Instruct native evals
python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --skip_oi_evals \
    --run_oe_eval_experiments \
    --oe_eval_task_suite SAFETY_EVAL \
    --workspace ai2/tulu-3-results

Dependencies

Package Purpose
beaker (CLI) Beaker experiment creation via beaker experiment create command
yaml Parsing and writing Beaker experiment YAML configurations
subprocess Launching Beaker CLI commands and OE-Eval wrapper script
argparse CLI argument parsing
open_instruct.launch_utils Cluster constants (WEKA_CLUSTERS) for storage mount decisions
open_instruct.utils General utility functions
configs/beaker_configs/default_eval.yaml Base evaluation task template with default image, datasets, and resource settings
scripts/eval/oe-eval.sh Wrapper script for submitting OE-Eval framework evaluations

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment