Principle:Allenai Open instruct Evaluation Pipeline

Knowledge Sources	Tulu 3 Open Instruct Docs MMLU GSM8K BIG-Bench Hard IFEval AlpacaEval TruthfulQA
Domains	Model Evaluation, Benchmarking, Post-Training, LLM Quality Assurance
Last Updated	2026-02-07 00:00 GMT

Overview

The Evaluation Pipeline is a systematic approach to assessing the quality of post-trained language models by submitting parallel benchmark evaluation jobs across a comprehensive suite of tasks that measure knowledge, reasoning, code generation, instruction following, safety, and multilingual capabilities.

Description

After each stage of post-training (SFT, DPO, GRPO), the resulting model checkpoint must be evaluated across a diverse set of benchmarks to determine whether the training improved capabilities without introducing regressions. The Evaluation Pipeline principle addresses this by defining a standardized set of benchmarks, automating the submission of evaluation jobs to a compute cluster, and aggregating results for comparison.

This principle solves several problems:

Comprehensive Coverage: A single model must be evaluated on 20+ benchmark configurations spanning knowledge (MMLU), mathematical reasoning (GSM8K, MATH), logical reasoning (BBH), code generation (HumanEval, MBPP, EvalPlus), instruction following (IFEval), open-ended generation quality (AlpacaEval), factual accuracy (TruthfulQA), safety (ToxiGen, XSTest), and multilingual understanding (TyDiQA).
Consistency: Every model is evaluated with identical prompts, few-shot configurations, and scoring methods, enabling fair comparison across training runs.
Scalability: Evaluation jobs run in parallel on the cluster, so a full evaluation suite that would take days sequentially can complete in hours.
Integration with OE-Eval: Beyond Open Instruct's native evaluation scripts, the pipeline can also submit evaluations to the OE-Eval framework, which provides additional standardized benchmark suites (OLMO_3, TULU_3_DEV, SAFETY_EVAL, etc.).

Usage

Use this principle when you need to:

Evaluate a newly trained model checkpoint across the standard Tulu 3 benchmark suite.
Compare multiple training configurations or hyperparameter sweeps on the same evaluation criteria.
Generate evaluation results for publication, model cards, or internal review.
Upload evaluation results to HuggingFace Hub for public reporting.
Run safety evaluations (ToxiGen, XSTest) to ensure model alignment.

Theoretical Basis

Multi-Dimensional Evaluation

Language model quality cannot be captured by a single metric. The Tulu 3 evaluation philosophy follows the principle that a well-rounded post-trained model must demonstrate competence across multiple orthogonal dimensions:

Knowledge and Comprehension:

MMLU (Massive Multitask Language Understanding): Measures factual knowledge across 57 academic subjects. Evaluated in both 0-shot and 5-shot configurations to assess both raw knowledge and in-context learning ability.

Mathematical and Logical Reasoning:

GSM8K (Grade School Math 8K): Tests multi-step arithmetic reasoning. Evaluated in both direct and chain-of-thought (CoT) modes to measure both answer accuracy and reasoning quality.
MATH: Challenges with competition-level mathematics problems, evaluated with chain-of-thought prompting.
BBH (BIG-Bench Hard): A curated subset of BIG-Bench tasks that are challenging for language models, evaluated in both direct and CoT configurations.

Code Generation:

HumanEval / EvalPlus: Measures functional code generation ability using pass@k metrics at multiple temperatures (0.1 for precision, 0.8 for diversity).
MBPP / EvalPlus: An alternative code generation benchmark with different problem distributions.

Instruction Following:

IFEval (Instruction Following Evaluation): Tests the model's ability to follow specific formatting and content constraints in instructions.
AlpacaEval / AlpacaEval 2: Measures open-ended instruction following quality using a judge model, with version 2 using a length-controlled evaluation to reduce verbosity bias.

Factual Accuracy and Safety:

TruthfulQA: Evaluates whether the model generates factually accurate responses and avoids common misconceptions. Uses dedicated truth and informativeness judge models.
ToxiGen: Measures the model's tendency to generate toxic content across different demographic groups.
XSTest: Tests the model's ability to refuse harmful requests while remaining helpful for benign ones.

Multilingual Understanding:

TyDiQA: Evaluates question answering ability across multiple languages, tested both with and without context passages to measure retrieval-augmented and open-book capabilities.

Few-Shot Configuration Strategy

The choice of few-shot configuration is deliberate for each benchmark:

0-shot MMLU: Tests the model's instruction-following ability and internalized knowledge without demonstrations.
5-shot MMLU: Tests in-context learning ability with demonstrations.
8-shot GSM: Provides sufficient examples for the model to learn the expected answer format.
4-shot MATH: Balances context length with demonstration quality for harder problems.
1-shot TyDiQA: Minimal demonstrations for multilingual tasks to avoid overwhelming the context with English-centric examples.

Temperature-Controlled Code Evaluation

Code generation benchmarks are run at two temperatures to capture different aspects of model capability:

Temperature 0.1: Near-greedy decoding measures the model's best single-attempt code quality (pass@1).
Temperature 0.8: High-diversity sampling with 20 samples measures the model's coverage of the solution space (pass@5, pass@10, pass@20).

Chat Template Adaptation

All evaluations use chat-formatted prompts, with the template automatically selected based on the model family:

Tulu models: Use the Tulu chat format as default.
OLMo models: Use the OLMo-specific chat format.
Llama 2 Chat models: Use the Llama 2 system/user/assistant format.
HuggingFace Tokenizer Template: An override option that uses the tokenizer's built-in chat template, providing maximum compatibility.

Automatic Resource Scaling

Evaluation resource requirements scale with model size. The pipeline applies heuristic-based adjustments:

Batch size reduction: 13B models halve batch sizes; 30B-72B models quarter batch sizes to fit in GPU memory.
GPU multiplier: Larger models (30B+) receive additional GPUs, with code evaluation tasks receiving an extra multiplier due to higher memory requirements from sampling.

Practical Guide

Step 1: Identify the Model Checkpoint

After training completes, identify the model checkpoint. It can be specified as:

A HuggingFace model name prefixed with hf- (e.g., hf-allenai/Llama-3.1-Tulu-3-8B).
A Beaker dataset ID for models stored on the Beaker platform.
A local path on shared storage (e.g., a Weka path like /weka/oe-adapt-default/...).

Step 2: Select Benchmarks

Choose which benchmarks to run. The default suite includes all 20+ configurations listed above, but you can select a subset with the --experiments flag for faster iteration:

# Quick evaluation on core benchmarks
--experiments mmlu_5shot gsm_cot bbh_cot ifeval

# Full evaluation suite (default)
# Omit --experiments to run all benchmarks

Step 3: Submit Evaluation Jobs

Submit all evaluation jobs in a single command. Each benchmark becomes a separate Beaker task within one experiment, running in parallel:

python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --workspace ai2/tulu-3-results

Step 4: Optionally Run OE-Eval Suite

For more comprehensive evaluation using AI2's OE-Eval framework, add the --run_oe_eval_experiments flag. This submits additional evaluations using standardized task suites:

python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --run_oe_eval_experiments \
    --oe_eval_task_suite OLMO_3 \
    --workspace ai2/tulu-3-results

Step 5: Upload Results to HuggingFace Hub

To make evaluation results publicly available, use the --upload_to_hf flag with the target HF dataset:

python scripts/submit_eval_jobs.py \
    --model_name hf-allenai/Llama-3.1-Tulu-3-8B \
    --location allenai/Llama-3.1-Tulu-3-8B \
    --is_tuned \
    --upload_to_hf "allenai/tulu-3-evals//results/Llama-3.1-Tulu-3-8B" \
    --hf_upload_experiments alpaca_eval alpaca_eval_2 \
    --workspace ai2/tulu-3-results

Step 6: Analyze and Compare Results

Evaluation results are stored as Beaker experiment outputs. Compare across training runs by collecting metrics from each experiment's output directory.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Submit_Eval_Jobs

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment