Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:ContextualAI HALOs Model Evaluation

From Leeroopedia


Knowledge Sources
Domains LLMs, Evaluation, LLM_Ops
Last Updated 2026-02-08 03:00 GMT

Overview

End-to-end process for evaluating an aligned language model on instruction-following benchmarks (AlpacaEval) and standard NLP benchmarks (LM Evaluation Harness).

Description

This workflow evaluates trained models using two complementary approaches. First, the model is sampled using vLLM on AlpacaEval prompts and scored against a reference model (e.g., GPT-4) to measure instruction-following quality. Second, the model is evaluated on multiple-choice and generation benchmarks through the LM Evaluation Harness covering reasoning, knowledge, and safety. An optional metrics summarization script aggregates results across multiple experiments for comparison.

Goals:

  • Measure model quality on instruction-following tasks via AlpacaEval win rates
  • Assess general capabilities via standard benchmarks (ARC, WinoGrande, BBH, GSM8K, etc.)
  • Compare multiple aligned models to identify the best configuration

Scope:

  • From a saved model checkpoint to evaluation metrics and leaderboard scores
  • Covers vLLM sampling, AlpacaEval scoring, LM Eval Harness benchmarks, and metrics summarization

Strategy:

  • vLLM provides fast inference for generating AlpacaEval samples
  • AlpacaEval uses GPT-4 (or GPT-4.1) as an annotator to compute length-controlled win rates
  • LM Eval Harness runs standardized benchmarks for objective comparison
  • The metrics summarizer aggregates results from log files into a structured CSV

Usage

Execute this workflow after any training pipeline (Offline SFT Alignment, Online Iterative Alignment, etc.) to measure the quality of the resulting model. This is also useful for comparing different alignment methods, hyperparameters, or training configurations side by side.

Execution Steps

Step 1: Sample_For_AlpacaEval

Generate responses to AlpacaEval prompts using vLLM for fast inference. The sampling script loads the trained model with tensor parallelism and generates one response per prompt from the AlpacaEval dataset.

What happens:

  • vLLM loads the model checkpoint with tensor parallel across specified GPUs
  • The AlpacaEval dataset prompts are loaded via the SFTDataLoader
  • For each prompt, a response is generated with the configured sampling parameters (temperature, top-p, max tokens)
  • Outputs are saved as JSON with instruction and output fields required by AlpacaEval

Key considerations:

  • The --mode alpacaeval flag ensures the output format matches AlpacaEval expectations
  • GPU count for vLLM tensor parallelism should be set based on model size
  • The stop token should match the model's chat template end token

Step 2: Run_AlpacaEval

Score the generated responses using the AlpacaEval evaluation framework. This compares model outputs against a reference model (typically GPT-4) using an LLM judge and computes win rates.

What happens:

  • The alpaca_eval evaluate command processes the sampled JSON file
  • An annotator model (configured via YAML, e.g., GPT-4.1 or GPT-4.1-mini) judges each response pair
  • Length-controlled win rate (LCWR) and raw win rate (WR) are computed
  • Results are added to the AlpacaEval leaderboard
  • An OpenAI API key is required for the judge model

Step 3: Run_LM_Eval_Harness

Evaluate the model on standard NLP benchmarks using the LM Evaluation Harness framework. This runs the model on multiple-choice tasks (ARC, WinoGrande, MMLU), reasoning tasks (BBH, GSM8K), code generation (HumanEval, MBPP), and safety/factuality tasks (TruthfulQA, ToxiGen, IFEval).

What happens:

  • The lm_eval command loads the model with HuggingFace's parallelize=True for multi-GPU inference
  • Each benchmark task is run with the configured batch size
  • Results include accuracy, exact match, or task-specific metrics
  • Output is logged to stdout or a file for later aggregation

Key considerations:

  • Some tasks require --confirm_run_unsafe_code (e.g., HumanEval)
  • GSM8K with chain-of-thought may have issues with auto batch size; use a fixed size
  • Gated datasets (e.g., GPQA) require HuggingFace Hub authentication

Step 4: Summarize_Metrics

Aggregate evaluation results from multiple experiments using the metrics summarization script. This parses log files from both AlpacaEval and LM Eval Harness runs, extracting key metrics into a structured CSV for easy comparison.

What happens:

  • The summarization script scans log files for evaluation results
  • AlpacaEval metrics (LCWR, WR) are extracted from leaderboard output
  • LM Eval metrics are extracted from the harness output format
  • All metrics are combined into a single CSV with one row per model
  • Results can be sorted and compared across different training configurations

Execution Diagram

GitHub URL

Workflow Repository