Workflow:ContextualAI HALOs Model Evaluation

Knowledge Sources	HALOs AlpacaEval LM Evaluation Harness
Domains	LLMs, Evaluation, LLM_Ops
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for evaluating an aligned language model on instruction-following benchmarks (AlpacaEval) and standard NLP benchmarks (LM Evaluation Harness).

Description

This workflow evaluates trained models using two complementary approaches. First, the model is sampled using vLLM on AlpacaEval prompts and scored against a reference model (e.g., GPT-4) to measure instruction-following quality. Second, the model is evaluated on multiple-choice and generation benchmarks through the LM Evaluation Harness covering reasoning, knowledge, and safety. An optional metrics summarization script aggregates results across multiple experiments for comparison.

Goals:

Measure model quality on instruction-following tasks via AlpacaEval win rates
Assess general capabilities via standard benchmarks (ARC, WinoGrande, BBH, GSM8K, etc.)
Compare multiple aligned models to identify the best configuration

Scope:

From a saved model checkpoint to evaluation metrics and leaderboard scores
Covers vLLM sampling, AlpacaEval scoring, LM Eval Harness benchmarks, and metrics summarization

Strategy:

vLLM provides fast inference for generating AlpacaEval samples
AlpacaEval uses GPT-4 (or GPT-4.1) as an annotator to compute length-controlled win rates
LM Eval Harness runs standardized benchmarks for objective comparison
The metrics summarizer aggregates results from log files into a structured CSV

Usage

Execute this workflow after any training pipeline (Offline SFT Alignment, Online Iterative Alignment, etc.) to measure the quality of the resulting model. This is also useful for comparing different alignment methods, hyperparameters, or training configurations side by side.

Execution Steps

Step 1: Sample_For_AlpacaEval

Generate responses to AlpacaEval prompts using vLLM for fast inference. The sampling script loads the trained model with tensor parallelism and generates one response per prompt from the AlpacaEval dataset.

What happens:

vLLM loads the model checkpoint with tensor parallel across specified GPUs
The AlpacaEval dataset prompts are loaded via the SFTDataLoader
For each prompt, a response is generated with the configured sampling parameters (temperature, top-p, max tokens)
Outputs are saved as JSON with instruction and output fields required by AlpacaEval

Key considerations:

The --mode alpacaeval flag ensures the output format matches AlpacaEval expectations
GPU count for vLLM tensor parallelism should be set based on model size
The stop token should match the model's chat template end token

Step 2: Run_AlpacaEval

Score the generated responses using the AlpacaEval evaluation framework. This compares model outputs against a reference model (typically GPT-4) using an LLM judge and computes win rates.

What happens:

The alpaca_eval evaluate command processes the sampled JSON file
An annotator model (configured via YAML, e.g., GPT-4.1 or GPT-4.1-mini) judges each response pair
Length-controlled win rate (LCWR) and raw win rate (WR) are computed
Results are added to the AlpacaEval leaderboard
An OpenAI API key is required for the judge model

Step 3: Run_LM_Eval_Harness

Evaluate the model on standard NLP benchmarks using the LM Evaluation Harness framework. This runs the model on multiple-choice tasks (ARC, WinoGrande, MMLU), reasoning tasks (BBH, GSM8K), code generation (HumanEval, MBPP), and safety/factuality tasks (TruthfulQA, ToxiGen, IFEval).

What happens:

The lm_eval command loads the model with HuggingFace's parallelize=True for multi-GPU inference
Each benchmark task is run with the configured batch size
Results include accuracy, exact match, or task-specific metrics
Output is logged to stdout or a file for later aggregation

Key considerations:

Some tasks require --confirm_run_unsafe_code (e.g., HumanEval)
GSM8K with chain-of-thought may have issues with auto batch size; use a fixed size
Gated datasets (e.g., GPQA) require HuggingFace Hub authentication

Step 4: Summarize_Metrics

Aggregate evaluation results from multiple experiments using the metrics summarization script. This parses log files from both AlpacaEval and LM Eval Harness runs, extracting key metrics into a structured CSV for easy comparison.

What happens:

The summarization script scans log files for evaluation results
AlpacaEval metrics (LCWR, WR) are extracted from leaderboard output
LM Eval metrics are extracted from the harness output format
All metrics are combined into a single CSV with one row per model
Results can be sorted and compared across different training configurations

Execution Diagram

GitHub URL

Workflow Repository