Principle:Allenai Open instruct Evaluation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Model Evaluation, Benchmarking, Post-Training, LLM Quality Assurance |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
The Evaluation Pipeline is a systematic approach to assessing the quality of post-trained language models by submitting parallel benchmark evaluation jobs across a comprehensive suite of tasks that measure knowledge, reasoning, code generation, instruction following, safety, and multilingual capabilities.
Description
After each stage of post-training (SFT, DPO, GRPO), the resulting model checkpoint must be evaluated across a diverse set of benchmarks to determine whether the training improved capabilities without introducing regressions. The Evaluation Pipeline principle addresses this by defining a standardized set of benchmarks, automating the submission of evaluation jobs to a compute cluster, and aggregating results for comparison.
This principle solves several problems:
- Comprehensive Coverage: A single model must be evaluated on 20+ benchmark configurations spanning knowledge (MMLU), mathematical reasoning (GSM8K, MATH), logical reasoning (BBH), code generation (HumanEval, MBPP, EvalPlus), instruction following (IFEval), open-ended generation quality (AlpacaEval), factual accuracy (TruthfulQA), safety (ToxiGen, XSTest), and multilingual understanding (TyDiQA).
- Consistency: Every model is evaluated with identical prompts, few-shot configurations, and scoring methods, enabling fair comparison across training runs.
- Scalability: Evaluation jobs run in parallel on the cluster, so a full evaluation suite that would take days sequentially can complete in hours.
- Integration with OE-Eval: Beyond Open Instruct's native evaluation scripts, the pipeline can also submit evaluations to the OE-Eval framework, which provides additional standardized benchmark suites (OLMO_3, TULU_3_DEV, SAFETY_EVAL, etc.).
Usage
Use this principle when you need to:
- Evaluate a newly trained model checkpoint across the standard Tulu 3 benchmark suite.
- Compare multiple training configurations or hyperparameter sweeps on the same evaluation criteria.
- Generate evaluation results for publication, model cards, or internal review.
- Upload evaluation results to HuggingFace Hub for public reporting.
- Run safety evaluations (ToxiGen, XSTest) to ensure model alignment.
Theoretical Basis
Multi-Dimensional Evaluation
Language model quality cannot be captured by a single metric. The Tulu 3 evaluation philosophy follows the principle that a well-rounded post-trained model must demonstrate competence across multiple orthogonal dimensions:
Knowledge and Comprehension:
- MMLU (Massive Multitask Language Understanding): Measures factual knowledge across 57 academic subjects. Evaluated in both 0-shot and 5-shot configurations to assess both raw knowledge and in-context learning ability.
Mathematical and Logical Reasoning:
- GSM8K (Grade School Math 8K): Tests multi-step arithmetic reasoning. Evaluated in both direct and chain-of-thought (CoT) modes to measure both answer accuracy and reasoning quality.
- MATH: Challenges with competition-level mathematics problems, evaluated with chain-of-thought prompting.
- BBH (BIG-Bench Hard): A curated subset of BIG-Bench tasks that are challenging for language models, evaluated in both direct and CoT configurations.
Code Generation:
- HumanEval / EvalPlus: Measures functional code generation ability using pass@k metrics at multiple temperatures (0.1 for precision, 0.8 for diversity).
- MBPP / EvalPlus: An alternative code generation benchmark with different problem distributions.
Instruction Following:
- IFEval (Instruction Following Evaluation): Tests the model's ability to follow specific formatting and content constraints in instructions.
- AlpacaEval / AlpacaEval 2: Measures open-ended instruction following quality using a judge model, with version 2 using a length-controlled evaluation to reduce verbosity bias.
Factual Accuracy and Safety:
- TruthfulQA: Evaluates whether the model generates factually accurate responses and avoids common misconceptions. Uses dedicated truth and informativeness judge models.
- ToxiGen: Measures the model's tendency to generate toxic content across different demographic groups.
- XSTest: Tests the model's ability to refuse harmful requests while remaining helpful for benign ones.
Multilingual Understanding:
- TyDiQA: Evaluates question answering ability across multiple languages, tested both with and without context passages to measure retrieval-augmented and open-book capabilities.
Few-Shot Configuration Strategy
The choice of few-shot configuration is deliberate for each benchmark:
- 0-shot MMLU: Tests the model's instruction-following ability and internalized knowledge without demonstrations.
- 5-shot MMLU: Tests in-context learning ability with demonstrations.
- 8-shot GSM: Provides sufficient examples for the model to learn the expected answer format.
- 4-shot MATH: Balances context length with demonstration quality for harder problems.
- 1-shot TyDiQA: Minimal demonstrations for multilingual tasks to avoid overwhelming the context with English-centric examples.
Temperature-Controlled Code Evaluation
Code generation benchmarks are run at two temperatures to capture different aspects of model capability:
- Temperature 0.1: Near-greedy decoding measures the model's best single-attempt code quality (pass@1).
- Temperature 0.8: High-diversity sampling with 20 samples measures the model's coverage of the solution space (pass@5, pass@10, pass@20).
Chat Template Adaptation
All evaluations use chat-formatted prompts, with the template automatically selected based on the model family:
- Tulu models: Use the Tulu chat format as default.
- OLMo models: Use the OLMo-specific chat format.
- Llama 2 Chat models: Use the Llama 2 system/user/assistant format.
- HuggingFace Tokenizer Template: An override option that uses the tokenizer's built-in chat template, providing maximum compatibility.
Automatic Resource Scaling
Evaluation resource requirements scale with model size. The pipeline applies heuristic-based adjustments:
- Batch size reduction: 13B models halve batch sizes; 30B-72B models quarter batch sizes to fit in GPU memory.
- GPU multiplier: Larger models (30B+) receive additional GPUs, with code evaluation tasks receiving an extra multiplier due to higher memory requirements from sampling.
Practical Guide
Step 1: Identify the Model Checkpoint
After training completes, identify the model checkpoint. It can be specified as:
- A HuggingFace model name prefixed with
hf-(e.g.,hf-allenai/Llama-3.1-Tulu-3-8B). - A Beaker dataset ID for models stored on the Beaker platform.
- A local path on shared storage (e.g., a Weka path like
/weka/oe-adapt-default/...).
Step 2: Select Benchmarks
Choose which benchmarks to run. The default suite includes all 20+ configurations listed above, but you can select a subset with the --experiments flag for faster iteration:
# Quick evaluation on core benchmarks
--experiments mmlu_5shot gsm_cot bbh_cot ifeval
# Full evaluation suite (default)
# Omit --experiments to run all benchmarks
Step 3: Submit Evaluation Jobs
Submit all evaluation jobs in a single command. Each benchmark becomes a separate Beaker task within one experiment, running in parallel:
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--workspace ai2/tulu-3-results
Step 4: Optionally Run OE-Eval Suite
For more comprehensive evaluation using AI2's OE-Eval framework, add the --run_oe_eval_experiments flag. This submits additional evaluations using standardized task suites:
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--run_oe_eval_experiments \
--oe_eval_task_suite OLMO_3 \
--workspace ai2/tulu-3-results
Step 5: Upload Results to HuggingFace Hub
To make evaluation results publicly available, use the --upload_to_hf flag with the target HF dataset:
python scripts/submit_eval_jobs.py \
--model_name hf-allenai/Llama-3.1-Tulu-3-8B \
--location allenai/Llama-3.1-Tulu-3-8B \
--is_tuned \
--upload_to_hf "allenai/tulu-3-evals//results/Llama-3.1-Tulu-3-8B" \
--hf_upload_experiments alpaca_eval alpaca_eval_2 \
--workspace ai2/tulu-3-results
Step 6: Analyze and Compare Results
Evaluation results are stored as Beaker experiment outputs. Compare across training runs by collecting metrics from each experiment's output directory.